BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
sumsar
Calcite | Level 5

Hi everyone, 

 

I have created a macro which permits to modify the format of all data in all the datasets of my librairy at once, in the way you can see in the file attached.

My macro works well, but it takes a very long time to process (sometimes up to 10 minutes!).

Could you try to find a way to optimize it please? You'll find the code below :

 

*We put all the datasets in a list and we get the number of datasets;

proc sql;

select distinct memname into :liste separated by " " from sashelp.vcolumn where libname="XXX ";

select count (distinct memname) into :nb from sashelp.vcolumn where libname="XXX";

 

 

*We replace the var column by the value of the var_decode column and we delete the var_decode column;

option FMTSEARCH=(XXX.formats);

 

%macro content;

%do i=1 %to &nb;

proc contents data=XXX.%scan(&liste,&i)  out=contents&i noprint;run ;

 

data data2&i ;

length name $32. ;

set contents&i (keep=name rename=(name=label));

where index (upcase(label), '_DECODE') gt 0 ;

name=substr (label, 1, index (upcase(label), '_DECODE') -1) ;

run ;

 

proc sort data=contents&i; by name; run;

proc sort data=data2&i; by name; run;

 

data data3&i ;

merge contents&i (in=A) data2&i (in=B);

by name ;

if B;

run ;

/* We create formats for numeric data*/

data _null_ ;

  set data3&i (where=(type=1) ) ;

  call execute (

"proc sort data=XXX.%scan(&liste,&i) out=data5&i (keep=" !! name !! " " !! label !! ") nodupkey;

      by " !! label !! " ;

run ;

data data6&i ( drop= " !! name !! " rename=(" !! label !!"=label));

      retain type 'n' fmtname " !! quote (label) !! ";

      retain start 0 end 0 ;

        set data5&i ;

        by " !! label !! " ;

        if first." !! label !! " then start=" !! name !! ";

        if last." !! label !! " then do ;

           end=" !! name !! " ;

             output ;

        end ;

run ;

Proc format library=XXX.formats cntlin=data6&i ;

run ; ");

run ;

 

proc sql;

select name into :var_list&i separated by " " from contents&i where index (upcase(name), '_DECODE') gt 0;

select count(*) into :nbr&i from contents&i where index (upcase(name), '_DECODE') gt 0;

quit;

 

%put &&var_list&i;

%put &&nbr&i;

 

%do j=1 %to &&nbr&i;

 

%let pos1= %index(%scan(&&var_list&i,&j),_decode);

%let pos2=%eval(&pos1 - 1);

 

data XXX.%scan(&liste,&i);

set XXX.%scan(&liste,&i);

format %substr(%scan(&&var_list&i,&j),1,&pos2) %scan(&&var_list&i,&j). ;

put _all_ ;

drop %scan(&&var_list&i,&j);

 

run;

%end;

%end;

%mend;

 

%content;

 

 

Thanks in advance.

 

 

1 ACCEPTED SOLUTION

Accepted Solutions
Reeza
Super User
data XXX.%scan(&liste,&i);

set XXX.%scan(&liste,&i);

format %substr(%scan(&&var_list&i,&j),1,&pos2) %scan(&&var_list&i,&j). ;

put _all_ ;

drop %scan(&&var_list&i,&j);

 

run;

%end;

The above step where you apply the formats is problematic because you're essentially reading and recreating the data set. Instead use PROC DATASETS to update the formats without recreating the data set. That will save you a significant amount of time.

 

If it still takes too long, run it a few times with msglevel=i and look at the timing to see which step is taking the longest and then work on improving/reducing the speed of that step.

 


@sumsar wrote:

Hi everyone, 

 

I have created a macro which permits to modify the format of all data in all the datasets of my librairy at once, in the way you can see in the file attached.

My macro works well, but it takes a very long time to process (sometimes up to 10 minutes!).

Could you try to find a way to optimize it please? You'll find the code below :

 

*We put all the datasets in a list and we get the number of datasets;

proc sql;

select distinct memname into :liste separated by " " from sashelp.vcolumn where libname="XXX ";

select count (distinct memname) into :nb from sashelp.vcolumn where libname="XXX";

 

 

*We replace the var column by the value of the var_decode column and we delete the var_decode column;

option FMTSEARCH=(XXX.formats);

 

%macro content;

%do i=1 %to &nb;

proc contents data=XXX.%scan(&liste,&i)  out=contents&i noprint;run ;

 

data data2&i ;

length name $32. ;

set contents&i (keep=name rename=(name=label));

where index (upcase(label), '_DECODE') gt 0 ;

name=substr (label, 1, index (upcase(label), '_DECODE') -1) ;

run ;

 

proc sort data=contents&i; by name; run;

proc sort data=data2&i; by name; run;

 

data data3&i ;

merge contents&i (in=A) data2&i (in=B);

by name ;

if B;

run ;

/* We create formats for numeric data*/

data _null_ ;

  set data3&i (where=(type=1) ) ;

  call execute (

"proc sort data=XXX.%scan(&liste,&i) out=data5&i (keep=" !! name !! " " !! label !! ") nodupkey;

      by " !! label !! " ;

run ;

data data6&i ( drop= " !! name !! " rename=(" !! label !!"=label));

      retain type 'n' fmtname " !! quote (label) !! ";

      retain start 0 end 0 ;

        set data5&i ;

        by " !! label !! " ;

        if first." !! label !! " then start=" !! name !! ";

        if last." !! label !! " then do ;

           end=" !! name !! " ;

             output ;

        end ;

run ;

Proc format library=XXX.formats cntlin=data6&i ;

run ; ");

run ;

 

proc sql;

select name into :var_list&i separated by " " from contents&i where index (upcase(name), '_DECODE') gt 0;

select count(*) into :nbr&i from contents&i where index (upcase(name), '_DECODE') gt 0;

quit;

 

%put &&var_list&i;

%put &&nbr&i;

 

%do j=1 %to &&nbr&i;

 

%let pos1= %index(%scan(&&var_list&i,&j),_decode);

%let pos2=%eval(&pos1 - 1);

 

data XXX.%scan(&liste,&i);

set XXX.%scan(&liste,&i);

format %substr(%scan(&&var_list&i,&j),1,&pos2) %scan(&&var_list&i,&j). ;

put _all_ ;

drop %scan(&&var_list&i,&j);

 

run;

%end;

%end;

%mend;

 

%content;

 

 

Thanks in advance.

 

 


 

View solution in original post

16 REPLIES 16
Patrick
Opal | Level 21

@sumsar 

There is logic in your code which I don't understand and if this is only about exchanging one permanent format with another - and there is not enough comment for me to understand what you're trying to achieve in detail.

 

If this is really only about exchanging one named format with another named format then the code could be much simpler.

 

And in regards of the current runtimes:

SAS Formats are on descriptor level and you don't need to process the data to change the format. Use Proc Datasets / Modify or SQL Alter Table instead of a SAS data step which not only processes all the data and re-creates the table but which also leads to potentially drop all other table attributes like indexes and constraints.

sumsar
Calcite | Level 5

Hi Patrick,

 

Thank you for your answer.

Do you see the file attached to my post? I think it could help you to see what I have initially and what I want to get with the macro.

 

 

Best regards

Patrick
Opal | Level 21

@sumsar 

I've looked into your attached doc but it didn't really answer my questions. I believe my thinking is very much in line with@Tom and @Astounding

 

What would help a lot is if you could explain what you really have to start with and what you need in the end. I believe such explanation could lead to a simpler and better performing approach than what you currently have.

 

Just saw that you replied. I'll post now anyway and then read your answer. 

 

...read your reply; that's not what I need to understand the problem.

sumsar
Calcite | Level 5

Thank you Patrick.

To describe exactmy what I try to do : I have a database which contains 25 datasets. In all my datasets, my coded variables (for example : Yes=1;0=No) are split in two columns : one in numeric format which contains 1 or 0 and a second one which contains the decode in text format (Yes or No). Note that the second column is always called with the name of the frist one, concatened with (_decode). 

For example as in the file attached I have a column var in numeric format and a column var_decode in text format.

My objective is to create, instead of these two columns, a unique column called var on which the following format (called var_decode. in my code) would be applied (Yes=1;No=0);

And I want only one macro to do that on all my coded variables in all my datasets at once and to replace my initial datasets (not to create new ones).

 

Best.

 

Astounding
PROC Star

Let me suggest an alternative approach.  Based on the idea that you are already familiar with the CNTLIN= option on PROC FORMAT ...

 

Save all your format definitions as a permanent SAS data set, suitable for use as a CNTLIN= data set.

 

Use PROC FORMAT to save your formats permanently, based on the contents of that SAS data set.

 

When you want to change the format definitions, change the permanent SAS data set that holds the format definitions, and re-run PROC FORMAT with CNTLIN=.

 

Maintain some sort of history of the permanent SAS data set just in case you need to go back to an earlier set of format definitions.

Tom
Super User Tom
Super User

Can you explain what the purpose of the code is in more detail?  You state:

modify the format of all data in all the datasets of my librairy at once

But don't explain what that means before launching into a complicated piece of code.

 

What are your inputs? What are your outputs?

Are you trying to define formats from data?

Are you trying to change which formats are attached to data?

Are you trying to decode formatted data into new text variables?

sumsar
Calcite | Level 5

Hi Tom,

 

Thank you for your answer.

Did you see the file attached to my post? I think it could help you to see what are my inputs and outputs.

Tell me if you need more explanation.

 

Best regards

Tom
Super User Tom
Super User

Are you talking about the diagram in the WORD file?  I am not sure why you attached pictures in a file instead of posting the pictures into your question, but I did look at them and they just helped me come up with the list of questions I posted. 

 

The pictures seem to be something that you were using to explain to yourself how SAS formats work and not anything specific about what your code is trying to accomplish.

 

sumsar
Calcite | Level 5

To describe exactly what I try to do : I have a database which contains 25 datasets. In all my datasets, my coded variables (for example : Yes=1;0=No) are split in two columns : one in numeric format which contains 1 or 0 and a second one which contains the decode in text format (Yes or No). Note that the second column is always called with the name of the frist one, concatened with (_decode). 

For example as in the file attached I have a column var in numeric format and a column var_decode in text format.

My objective is to create, instead of these two columns, a unique column called var on which the following format (called var_decode. in my code) would be applied (Yes=1;No=0);

And I want only one macro to do that on all my coded variables in all my datasets at once and to replace my initial datasets (not to create new ones).

 

Best.

Tom
Super User Tom
Super User

My interpretation is that your goal is to reduce the size of your datasets by eliminated the DECODE variables and instead keeping only the matching CODE variable and attaching a user define format that can be used to generate the DECODE value from the CODE value.

 

So step one is to get the NAMES of the variables in your dataset and find the PAIRS for var/var_decode.  This should be fast as it is only detailing with metadata.

 

Step two is use that list to generate code that can find the unique values of VAR and VAR_DECODE (for each pair).  That is the step that will take time as it will have to actually read the data at least once (and possible once for each pair).

 

Then use that data to define the new formats.  So if you pull it out in the format of a CNTLIN dataset for PROC FORMAT this is just a matter of running PROC FORMAT.  It should be fast, but speed will depend on the number of unique combinations found.

 

Then use the list of pairs to define code to modify that datasets to remove the DECODE variable and attach the format to the CODE variable.  This step will also take a long time as it will require you to remake the data, although the result should be smaller datasets than you started with.

 

Complications could happen:

  • How to you want to name the new formats?  Do you use the name of the CODE variable?  What if it ends with a number? Names of formats cannot end with a number.
  • If the same CODE/DECODE pairs appears in more than one source dataset.  What do you want to do then?
    • Do you want to make separate formats for each of them?  How do you want to name them?
    • Do you want to combine the data from the multiple datasets into one format? What if the decodes for the same code don't agree?
  • What if many variables actual are using the same format? Like the 1=Yes 0=No example you gave? 
    • Do you want to create that same format multiple times with different names?  As if you were using SPSS instead of SAS?
    • Or do you want to detect that they are the same format and create it just once and attach it to all of the appropriate variables? If so then how to generate a name for it?

 

sumsar
Calcite | Level 5

Thank you for your answer.

My answers below :

 

Complications could happen:

  • How to you want to name the new formats?  Do you use the name of the CODE variable?  What if it ends with a number? Names of formats cannot end with a number.

         In my current macro, I have called the format var_decode.

 

  • If the same CODE/DECODE pairs appears in more than one source dataset.  What do you want to do then?
    • Do you want to make separate formats for each of them?  How do you want to name them
    • Separate formats for each of them, always called var_decode.
    •  
    • Do you want to combine the data from the multiple datasets into one format? What if the decodes for the same code don't agree?
    • No I don't want to combine the data
  • What if many variables actual are using the same format? Like the 1=Yes 0=No example you gave? 
    • Do you want to create that same format multiple times with different names?  As if you were using SPSS instead of SAS?
    • I want to create the same format multiple times with different names
    • Or do you want to detect that they are the same format and create it just once and attach it to all of the appropriate variables? If so then how to generate a name for it?
    • No I want to create the same format multiple times with different names

I remind that the code that I have presented in my initial post works perfectly, but takes a long time to process. I would need help to optimize it, so if you think that my answers to your questions are not the best way to process, don't hesitate to tell me.

Tom
Super User Tom
Super User

Are you writing the formats for each dataset into a separate catalog? Otherwise you cannot create multiple SEX_DECODE formats. Instead the last dataset processed that has SEX/SEX_DECODE pair will define the meaning of the format.

Tom
Super User Tom
Super User

Run your macro with MPRINT and FULLSTIMER options on and figure out which steps are the ones taking time.

Then post the code and explanation for those steps and we can see if there is a better way to do those steps.

 

In general the best way is to not have to do this exercise at all.  Either just live with the two variable format. Or better define the formats before the original datasets are created and keep that definition for your use later.

Tom
Super User Tom
Super User

You can do a little to make it more efficient, but not much.

It looks like you are using PROC SORT to generate the data needed to define the formats.  I cannot think of a faster way to do that.  You might be able to do all of the pairs for a single dataset in one pass with a complicated data step.  But check how much time those steps are taking.

 

It looks like you are trying to generate proc format CODE instead of a CNTLIN dataset. So changing that step should make the macro simpler, but it is not clear how much faster it will be.

 

But the step where you are dropping/formatting the variables is where you could get a performance increase by doing all of the paris for a dataset in a single pass.  Right now you have to loop in the wrong place so that you are rebuilding the dataset once for each pair found in that dataset.  Instead have it generate code that does it all in one step. 

 

You could keep your macro variable "arrays" and just move the %DO loops inside the definition of the DATA step. Or better you could just generate the lists needed into macro variables instead.  So if you generated two variable named for example FMTLIST and DROPLIST then your code will look like:

data XXX.%scan(&liste,&i);
  set XXX.%scan(&liste,&i);
  format &fmtlist ;
  drop &droplist;
run;

I leave it as an exercise for you to figure out how to generate those macro variables from the data you have.  The syntax for the values of FMTlist will be varname=formatspec.  So result will look something like 

format race=race_decode. sex=sex_decode. ;
drop race_decode sex_decode ;

Ready to join fellow brilliant minds for the SAS Hackathon?

Build your skills. Make connections. Enjoy creative freedom. Maybe change the world. Registration is now open through August 30th. Visit the SAS Hackathon homepage.

Register today!
How to connect to databases in SAS Viya

Need to connect to databases in SAS Viya? SAS’ David Ghan shows you two methods – via SAS/ACCESS LIBNAME and SAS Data Connector SASLIBS – in this video.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 16 replies
  • 2096 views
  • 3 likes
  • 5 in conversation