Re: List of variables based on suffix (not prefix) - Page 2

Tom · Posted 07-27-2023 02:24 PM

Create a dataset with the list of variables and then you can create categories for them.

Let's keep it simple and just use PROC CONTENTS to get the variable names.

proc contents data=mylib.mydata out=contents noprint; run;

Now you can use the name to figure out the groups. Perhaps you want to pull out a 1 character prefix and a 3 character suffix?

data groups;
  set contents;
  prefix = substr(name,1,1);
  suffix= substr(name,length(name)-2);
  middle = substrn(name,2,length(name)-4);
run;

Now you can query the list more easily;

proc sql noprint;
select nliteral(name) 
  into :namelist separated by ' '
  from groups
  where prefix in ('A','C')
    and suffix in ('TDY')
;
quit;

_maldini_ · Posted 07-27-2023 03:35 PM

Yikes, you're asking for a lot more questions w/ that idea!

Before I try to adapt that code to play with it, what output will querying the list using PROC SQL produce?

Quentin · Posted 07-27-2023 04:41 PM

@_maldini_ wrote:

Yikes, you're asking for a lot more questions w/ that idea!

Before I try to adapt that code to play with it, what output will querying the list using PROC SQL produce?

Always a good idea to try it.

The code is very similar to Paige's code. It's using PROC SQL generate a macro variable that holds a list of variable names. Paige's code uses a dictionary table as the source for variable names in a table, and Tom's code uses PROC CONTENTS to output a dataset with the variable names.

Note also Tom's suggestion to use the DATA step to parse the variable names into their logical components. This way you could use PROC SQL to create a list of all variables from timepoint='A', or all variables with variableName='CGTDY'.

A third alternative would be to use a data dictionary / codebook as a datasource. Even when survey developers make poor decisions like creating a data format with thousands of variables, they usually will document the data. So if you have access to a data dictionary that lists all of the variables, with the timepoint, label, etc, often that dictionary can be imported into a dataset and used as a source of helpful metadata.

The Boston Area SAS Users Group is hosting free webinars!

Register now at https://www.basug.org/events.

_maldini_ · Posted 07-30-2023 06:18 PM

I'm doing my best to figure this out. It's confusing for me and I appreciate all your help. I have a partial solution (below) that utilizes the approach provided by @PaigeMiller.

proc sql noprint;
    select name into :colnames separated by ' ' /*  Selects the name variable, which would typically contain column names.;  */
/*  It then places the selected names into a macro variable named colnames, with each name separated by a space.; */
    from dictionary.columns
/*  This line specifies the source of the data, which is the dictionary.columns table in SAS.  */
/*  This is a special table that contains metadata about all available datasets, including column names. */
    where 
    
/* 	Age */
	substr(name,4) eq 'AGE' or
/* 	Education */
	substr(name,4) eq 'ED' or
	
/* 	BMI */
	substr(name,4) eq 'BMI' or
/* 	Note: Neither BMI nor height nor weight are available in the data I have. */

/* 	Exam dates */
/* 	Not including exam date 35 (see below as J02EXDAT) */
	substr(name,2) eq '02DATE' or
	
/*  Smoking status */
    substr(name,2) eq '10SMOKE'
       
        
    and libname='CARDIA' /* Whatever libname is needed goes here, upper case */
    and memname='C1260REQ06_29_2023' /* Your data set name goes here, upper case */
	;
quit;

/***********************************************************************************************************/
/* Creating a subset. */
/* &colnames (from above) is included in the KEEP statement. */
/***********************************************************************************************************/
data cardia.go;
    set CARDIA.C1260REQ06_29_2023;
    keep 
    ID
    SHORT_ID
	SEX
	race

/*  Suffix Variables  */
    &colnames

/*  Preffix Variables  */    
	CENTER:    
	
/* 	Per Lucia: The exam at Y35 was carried out in two phases. We are still discussing the best way to assemble a Reference File for this exam.  */
/* 	We have variable J02EXDAT. */
	J02EXDAT
	
/* 	TAKING DIABETES MEDS (1=NO,2=YES) at baseline */
	FDIABMED
    ;

    label J02EXDAT="DATE OF EXAM 10";
run;

The above syntax allows me to use a list based on a suffix in a KEEP statement to create a subset (per my original post).

I would also like to be able to subset by prefix (e.g., subset all the variables from a particular timepoint; "A" = baseline). @Tom offered an approach that looks like it might offer that, but I can't figure out how to go from the PROC SQL to a dataset. When I run a PROC PRINT on the "groups" dataset the variables are all under the column "names" and the other columns are all the metadata categories:

Using what I've adapted from @PaigeMiller above (because at least I understand that part), Is there a way to create a variable that would allow me to subset by prefix?

data cardia.go_a;
  set cardia.go;
  where prefix = "A";
run;

Thanks for your help.

PaigeMiller · Posted 07-30-2023 06:54 PM

I strongly agree with the others (@Patrick , @Quentin and @ballardw and anyone else I missed) who have stated that this very wide data set would be much easier to handle if it was narrow and long. If I were you, @_maldini_ I would bite the bullet and do the work to convert the data set to narrow and long.

Using what I've adapted from @PaigeMiller above (because at least I understand that part), Is there a way to create a variable that would allow me to subset by prefix?

data cardia.go_a;
  set cardia.go;
  where prefix = "A";
run;

Thanks for your help.

WHERE allows you to select ROWS, the entire conversation has been about selecting COLUMNS.

But you could use the PROC CONTENTS output or DICTIONARY.COLUMNS table to find variable names that BEGIN with A, its really analogous to what I showed about finding names that END with a certain prefix.

--
Paige Miller

_maldini_ · Posted 07-30-2023 08:00 PM

I'm certainly willing to transform the dataset based on this guidance (assuming I can figure it out w/ all these variables). Wouldn't it be easier if I first subset the dataset to something more manageable?

Regarding my original objective(s): It looks like the synxtax below does what I want it to do. I'm sure there are better ways to do this, but in the spirit of @Reeza ("There's probably more efficient methods, but if this is one you understand and can edit/update yourself it's the easiest")...

Any chance you could take a quick look at the syntax below and let me know if you think anything looks objectively wrong?

proc sql noprint;
    select name into :colnames separated by ' ' /*  Selects the name variable, which would typically contain column names.;  */
/*  It then places the selected names into a macro variable named colnames, with each name separated by a space.; */
    from dictionary.columns
/*  This line specifies the source of the data, which is the dictionary.columns table in SAS.  */
/*  This is a special table that contains metadata about all available datasets, including column names. */
    where 
    
/* 	Age */
	substr(name,4) eq 'AGE' or
/* 	Education */
	substr(name,4) eq 'ED' or
	
/* 	BMI */
	substr(name,4) eq 'BMI' or
/* 	Note: Neither BMI nor height nor weight are available in the data I have. */

/* 	Exam dates */
/* 	Not including exam date 35 (see below as J02EXDAT) */
	substr(name,2) eq '02DATE' or
	
/*  Smoking status */
    substr(name,2) eq '10SMOKE'
       
        
    and libname='CARDIA' /* Whatever libname is needed goes here, upper case */
    and memname='C1260REQ06_29_2023' /* Your data set name goes here, upper case */
	;
quit;

/***********************************************************************************************************/
/* Creating a subset. */
/* &colnames (from above) is included in the KEEP statement. */
/***********************************************************************************************************/
data cardia.go;
    set CARDIA.C1260REQ06_29_2023;
    keep 
    ID
    SHORT_ID
	SEX
	race

/*  Suffix Variables  */
    &colnames

/*  Preffix Variables  */    
	CENTER:    
	
/* 	Per Lucia: The exam at Y35 was carried out in two phases. We are still discussing the best way to assemble a Reference File for this exam.  */
/* 	We have variable J02EXDAT. */
	J02EXDAT
	
/* 	TAKING DIABETES MEDS (1=NO,2=YES) at baseline */
	FDIABMED
    ;

    label J02EXDAT="DATE OF EXAM 10";
run;

/* proc contents data=cardia.go varnum; run;  */
/* proc print data=cardia.go (obs=10); run;  */

/* ********************************************************************************************************* */
/* The syntax below allows for subsetting based on a prefix (i.e., exam year) */
/* ********************************************************************************************************* */
/* To subset for exam year 35: "J" */

proc sql noprint;
    select name into :colnames_j separated by ' ' 
    from dictionary.columns
    where upcase(substr(name,1,1)) eq 'J' /* Your suffix goes here */
	    and libname='CARDIA' /* Whatever libname is needed goes here, upper case */
	    and memname='go' /* Your data set name goes here, upper case */
;
quit;


data CARDIA.go_j;
    set CARDIA.go;
    keep 
    ID
    SHORT_ID
	SEX
	race
	center30

/*  Suffix Variables  */
    &colnames_j
  
	J02EXDAT
    ;
run;

proc contents data=cardia.go_j varnum; run; 
proc print data=cardia.go_j (obs=10); run;

Thanks for your time and patience w/ me.

Tom · Posted 07-31-2023 10:26 AM

Looks fine, but you appear to be missing ( ) in your WHERE clause.

If you code an expression like

A or B or C and X and Y

you need () to make sure that it does:

(A or B or C) and X and Y

And not

A or B or (C and X and Y)

If the prefixes on the names are varying length then you will need to use LIKE operator

name like '%BMI'

instead of SUBSTR()

substr(name,4) eq 'BMI'

to find the matching suffixes.

Also note that your NAME tests are assuming the NAME values are in uppercase letters. Unlike the LIBNAME and MEMNAME variables the NAME variable values can be in mixed case.

Tom · Posted 07-31-2023 11:55 AM

You seem to have two types of suffixes.

Those that start in column 2

substr(name,2) eq '02DATE'
substr(name,2) eq '10SMOKE'

And those that start in column 4

substr(name,4) eq 'AGE'
substr(name,4) eq 'ED'
substr(name,4) eq 'BMI'

You could just generate a macro variable for each suffix. Probably easiest to just do it once and then reuse them over and over.

%let _01bmi=A01AGE B01AGE C01AGE .... ;
%let _02race = A02RACE B02RACE C02RACE ..;

You could use a simple program to generate those %LET statements for a particular file.

proc contents data=cardia.C1260REQ06_29_2023 noprint out=contents;
run;

data groups;
  set contents;
  suffix=substr(name,2);
  macname=suffix;
  if not nvalid(macname,'v7') then macname='_'||macname;
run;
proc sort data=groups; by macname ; run;
data _null_;
  file 'var_groups.sas' column=cc ;
  set groups;
  by macname;
  if first.macname then put '%let ' macname '= ' @ ;
  if 75 < (cc + length(name)) then do;
    put / '  ' @ ;
    split=1;
  end;
  put name @;
  if last.macname then do;
     if split then put ;
     put ';' ;
  end;
run;

So that you get a file like:

%let _01bmi=A01AGE B01AGE C01AGE .... ;
%let _02race = A02RACE B02RACE C02RACE ..;

ballardw · Posted 07-27-2023 04:08 PM

@_maldini_ wrote:

Apologies. I want to subset an enormous dataset w/ 4962 variables using a KEEP statement. The data are derived from a longitudinal study w/ multiple time points. Variables that begin with "A" represent time point 1. Variables that begin with "B" represent time point 2. Etc.

A good example of a decision made long ago that was wrong. The data for most purposes should have the same variables but one that holds the "time point" information and one observation per timepoint per respondent.

As a changing part based on a source then the "timepoint" should be the suffix, not the actual variable if the main use is going to be looking at the topics and not the timepoints for analysis.

One very common thing with longitudinal data is determining which respondents were available for all times, or when they dropped out (last period of participation) or intermittent responses. This wide format is going to make it very hard to answer those especially if your response values are not able to differentiate between "no value because the participant wasn't reached" and "no value".

Seeing 4000+ variables alone quite often points to a sub-optimal data structure.

Quentin · Posted 07-27-2023 04:20 PM

This sort of extremely wide dataset becomes problematic to work with very quickly. You've got longitudinal data strung out with one record per patient, and the same question (number of cigarettes smoked today) was asked over nine time-points, so you have nine variable: A09CGTDY to I09CGTDY. (If A-I is the timepoint, what is the meaning of the 09? ). Then presumably you have another question for number of alcoholic drinks, and it would also have 9 variables A09ALTDY to I09ALTDY.

When working with data like this, usually the best first step is to transpose it into a more narrow format. So instead of one patient having 1 row of data with 19 columns (PatientID A09CGTDY-I09CGTDY A09ALTDY-I09ALTDY), one patient has 9 rows of data (one per timepoint) with the columns PatientID Timepoint CGTDY ALTDY.

As you look at your survey data, you may find many other examples of repeating questions which could be further broken out and separated into their own datasets.

We've all been in the position of receiving data in an undesirable format and having to work with it. Life is usually easier if you put in some time to reshape the data before you start analyzing it.

The Boston Area SAS Users Group is hosting free webinars!

Register now at https://www.basug.org/events.

Kurt_Bremser · Posted 07-31-2023 10:37 AM

@_maldini_ wrote:

Apologies. I want to subset an enormous dataset w/ 4962 variables using a KEEP statement. The data are derived from a longitudinal study w/ multiple time points. Variables that begin with "A" represent time point 1. Variables that begin with "B" represent time point 2. Etc.

The list of variables I used in my original question was 1 variable measured at multiple time points (A09CGTDY to I09CGTDY). I want to include all time points for this variable in the subset.

I want to use the KEEP statement to reference multiple lists. This list is just for one variable. I want to do this for 25+ variables.

Would I create a separate proc sql for each variable w/ its individual suffix and character count?

Sorry for not being more clear in the beginning.

From your description, it becomes obvious that you have data (at least the timepoints) stored in structure (variable names), which is always a BAD IDEA. Transpose to a long layout first; as soon as data IS data, handling it becomes MUCH easier.

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

Patrick · Posted 07-27-2023 11:06 PM

Often having a narrow and not a wide table makes things easier. But let's assume that the data you've got and need to keep below how you could create your keep list.

1. Create some sample data

data work.have;
  set sashelp.class;
  length A09CGTDY I09CGTDY 8;
run;

2. Select all variables where the name ends with DY (two possible expressions provided).

%let keeplist=;
proc sql noprint;
  select name into :keeplist separated by ' '
  from dictionary.columns
  where 
    libname='WORK' and memname='HAVE' 
    and find(name,'DY','it',length(name)-1)
/*    and prxmatch('/DY$/oi',strip(name))*/
    ;
quit;
%put &=keeplist;

3. Subset the data using &keeplist

data want1;
  set have(keep=&keeplist);
run;

IF the variables you want to keep are in sequence in your table then you can also use double dash notation in your keep option/statement. Proc Contents will show you the order.

proc contents data=work.have order=varnum;
run;quit;

Let's say you want to keep Name and all variables between Height and I09CGTDY then below keep notation would work.

data want2;
  set have(keep=name height--I09CGTDY);
run;

Catch up on SAS Innovate 2026