Re: Deleting variables by varnum

MichaelD · Posted 06-16-2010 12:49 PM

Ok, I have a lot variables and I need to delete more then half of them.

Unfortunately, very few are sequential (ie X1-X10) very few are grouped together (211=varnum(dataset,varnameA), 212=varnum(dataset,varnameB,... 236=varnum(dataset,varnameZ)) and then there are just a bunch all over the place.

I wrote a little macro (see end of post) that works for the cases where the variables are grouped so when I have a few separate clusters a can delete them pretty easily. Of course it can also keep lists of variables just as easily.

I'm hoping to make some changes so I can delete the variables all over the place easily.
The current macro is:

%macro VarList(DATASET,SEP,i=,n=);

proc sql;
create table VarListVars as
select varnum, name
from dictionary.columns
where memname = &DATASET.;
quit;

/* maybe this can be done inside the proc sql also*/
data VarListTemp;
set VarListVars(firstobs=&i obs=&n);
run;

%global VarList;

proc sql noprint;
select name into :VarList separated by &SEP
from VarListTemp;
quit;

/*EDIT:I guess I should add something to delete my temp datasets but I'm not too worried about that at this point*/

%mend VarList;

%VarList('dataset',' ',i=211,n=236);

Then I use
data datasetB;
set dataset;
drop/keep/(if with sep='=0 | ')/whatever &VarList;
run;

I'd like to be able to use a vector of varnums. Like change the macro to accept v={1,10,38...,all integers from 211 to 236,... 525 to 538,...} instead of i,n

Thanks Message was edited by: MichaelD

Cynthia_sas · Posted 06-16-2010 01:40 PM

Hi:
FIRSTOBS and OBS are not references to variable values or variable names. FIRSTOBS and OBS allow you to select observations or rows from the data. So, for example, if you do this:
[pre]
%let i = 15;
%let n = 17;
proc print data=sashelp.class(firstobs=&i obs=&n);
title "Firstobs=&i and obs=&n";
run;
[/pre]

Then, PROC PRINT will start selecting rows or observations with ROW 15 and will continue to select observations until the OBS= value is reached (in this case, ROW 17). This means that PROC PRINT will disply rows 15, 16 and 17 from SASHELP.CLASS.

Knowing this behavior of FIRSTOBS and OBS, I am confused about how you think your usage is going to result in the deletion of "all integers from 211 to 236" unless your choice of those same values for &I and &N was coincidental.

When you talk about deleting "integers", to me that seems like you're saying you want to delete variable VALUES, not the variables themselves.

You indicate that you are trying to delete the variables by their variable number or position in the data, but that seems an unnecessary complication or distinction. If you know that you want to delete VARNAMEA, VARNAMEB and VARNAMEZ, then you only need to list those variable names in your DROP statement (or list the variables you want to keep in your KEEP statement).

In addition, I find it hard to undestand how your PROC SQL step is working. MEMNAME in dictionary.columns is a character variable, so the usual construction is:
[pre]
where memname = "&DATASET" OR
where upcase(memname) = "%upcase(&DATASET)"
[/pre]

For some more ideas about how to use DICTIONARY.COLUMNS, you might search through previous forum postings for some of the many examples that have been previously posted here.

In addition, these papersand reference documentation might prove useful:
http://support.sas.com/documentation/cdl/en/sqlproc/62086/HTML/default/a001385596.htm
http://support.sas.com/resources/papers/proceedings09/053-2009.pdf
http://www2.sas.com/proceedings/sugi30/057-30.pdf
http://www2.sas.com/proceedings/sugi30/070-30.pdf
http://www2.sas.com/proceedings/sugi29/237-29.pdf
http://www.codecraftersinc.com/pdf/DictionaryTablesRefCard.pdf
http://www.lexjansen.com/pharmasug/2006/tutorials/tu03.pdf
http://www.qsl.net/kd6ttl/sas/sqlutilov.pdf

cynthia

MichaelD · Posted 06-16-2010 02:07 PM

Well the output of:
proc sql;
create table VarListVars as
select varnum, name
from dictionary.columns
where memname = &DATASET.;
quit;

is a 2 column dataset with variables varnum and name of the variables from &DATASET.
So observation i in the new data set VarListVar is varnum=i name='whatever the ith variable of &DATASET is named'

My problem is I have hundreds of variables and of those I have about 100-200 I need to keep. Its impractical to type out all the variable names I want to keep or delete since this will lead to errors. Its not as error prone for me to type out a vector of variable numbers I want to keep though. And that's what I'd like to change my macro to be able to do.

Since the output VarListVars is a list of all variables and their corresponding varnum I can (semi) easily scan the list for the variables I want to delete or keep

Flip · Posted 06-16-2010 02:22 PM

Since you are creating the valistvars table, why not edit that table removing the variables you want to drop, then create a macro variable containing the remaining names with an into clause then using that macro variable in a keep statement?

select name into :keepnam from varlistvars ...

data ;
set original;
keep ( &keepnam);

MichaelD · Posted 06-16-2010 02:43 PM

I actually didn't think of that, I'm a very hands off user. It definitely works. Thanks.

But I would still like to implement using a vector of variable numbers if anyone else comes up with an idea. The thing is deleting by hand is not very repeatable. Sure, I can export the dataset so I can reuse the list of variables it but if don't have access to that file I'd be stuck recreating what I need.

Or worse, I may pass on my final code (not this macro) and someone might not know how to use the files which give the VarList strings

ArtC · Posted 06-16-2010 03:05 PM

Here is a way to retrieve variable names based on variable numbers in a DATA step. The OPEN, VARNAME, and CLOSE functions adapt very well to the macro language as well.

[pre]
data varnums;
number = 1; output varnums;
number = 4; output varnums;
run;

data _null_;
length namelist $500;
dsid = open('sashelp.class');
do until (done);
set varnums end=done;
vname = varname(dsid,number);
put vname=;
namelist = catx(' ',namelist,vname);
end;
call symputx('namelist',namelist);
rc = close(dsid);
stop;
run;
%put namelist &namelist;
[/pre]

Peter_C · Posted 06-16-2010 05:35 PM

MichaelD
to make a solution "repeatable" would be quite simple --- if --- you have a simple rule for inclusion or exclusion of the columns
--- or you have a data set model (having just the columns you want on your final table)

Are you in either of these situations?

PeterC

data_null__ · Posted 06-16-2010 04:48 PM

I believe the IN operator provide the functionally you seek.

[pre] if varnum in(1 3 4:12 50:60 80:100 19); [/pre]

[pre]
*** Create data;
proc plan seed=477976307;
factors name = 100 prefix=1 of 4 / noprint;
output out=sample prefix cvals=('A' 'B' 'C' 'D');
run;
data sample;
set sample;
length _name_ $32;
_name_ = cats(prefix,name);
run;
proc transpose data=sample out=sample(drop=_:);
var _name_;
id _name_;
run;

*** var num list;
%let dataset=sample;
proc sql;
create table VarListVars as
select varnum, name
from dictionary.columns
where libname eq 'WORK' and memname = "%upcase(&DATASET.)";
quit;
run;

*** use this data set to create a list of variables to drop or keep;
data drop;
set varListVars;
if varnum in(1 3 4:12 50:60 80:100 19);
run;
proc print;
run;
[/pre]

MichaelD · Posted 06-16-2010 06:00 PM

data _null_;: I believe that IS what I am looking for

PeterC: I'm not sure what you mean by a 'simple rule'. Its definitely not as easy as every 3rd variable or variables that have 'c' as their 4th letter (though if you know how to do that it may be useful someday).

The ones I want almost seem to be randomly distributed and similar variables, maybe city and state, are no where near each other. I don't know.

As for a data model of the variables I want... that's what I'm building. I don't want to count on the variables being in the same order every time, but I hope 'they' won't rename variables too often.

Flip · Posted 06-17-2010 07:58 AM

"I don't want to count on the variables being in the same order every time"

In this case you should stay away from using the variable number. You might be deleting different variables each time.

What peter means is if there is a logical way to choose names such as:

Select name .........
WHERE (some name pattern);

MichaelD · Posted 06-17-2010 08:50 AM

Flip, I think you've misunderstood me.

Say I have a dataset of word counts from today's New York Times. Maybe it was built sequentially so that variables 1-x are all in the front page article, observation one.

Now there's no problem for today since all variables will be present for all articles and in the same order. But tomorrow the sequence of the variables/words will change.

So if I want to do text mining I will export the list of 'keep' variables from today and import that all other days, making revisions as necessary, rather then rebuild my lexicon each day.

Now, this is NOT what I'm doing but I think its a clear example. I would, however, find it useful to know how to keep, for example, every variable/word ending in 'ed' or every variable with 'c' as the third letter. I think those fall under "some name pattern".

Thanks for the help.

Flip · Posted 06-17-2010 08:56 AM

This sort of approach should do that.

proc sql;
select name into :keeplst separated by ' ' from dictionary.columns
where libname = 'XXXX' and memname = 'YYYY'
and (upcase(name) LIKE '%ED') or index(upcase(name), 'C') = 3);
end;

data new;
set old (keep = &keeplst);
...
run;

Peter_C · Posted 06-17-2010 09:24 AM

name patterns can be tested from a list of the column names. That can be obtained from the OUT= data set from proc contents, from a transpose of all columns of the dataset(but with obs=0), or from an sql query on sashelp.vcolumn (remembering to reduce the rows considered with a where clause on libname and memname). The first of these 3 would be my general recommendation
proc contents data=your.dataset noprint out= your_dataset_columns; run;
That can be filtered on "front" or "back" of the name columns and the selected list stored in a macro variable with fairly clear sql code, like:[pre]%let ed_names = !none! ; * a default designed to look impossible ;
proc sql noprint ;
select name into :ed_names separated by ' '
from your_dataset_columns
where lowcase(substr( name, length(name)-1, 2)) ='ed'
;
%put &sqlobs found, named &ed_names ;
quit ;[/pre]As column names might be mixed case, lowcase(). The length() function indicates the last non-blank character of the column name, so to extract the columns which end in the substr "ed" (in either case) I offered that where clause.

If you follow up these features in the online doc, you'll be ready to do most of these types of "name-feature" selections.
More advanced selection is within the domain of Perl regular expression filtering, which you can find by searching the doc for "prxparse".

hope this helps.

peterC

data_null__ · Posted 06-17-2010 09:28 AM

From the description of the your data my impression is that you are using the wrong data structure. Plus you want to do operations on the meta data that are more suited, that is more easily done, on data. I would transpose so that I had something like.

[pre]
id page word count
NYT 1 of 68
NYT 1 the 77
[/pre]

It would be helpful to me to see and example of the data you are working with.

Peter_C · Posted 06-17-2010 08:55 AM

as often as variable order and name clarifies anything, when in "reduce the columns mode", I find often find stats on variable values more useful. MIN and MAX being integers indicate columns which are probably integer in their origination (and finding integers in the percentile stats generally confirms the point for me). A small-ish number of distinct numeric values is indicative of categorical data. And obviously, the columns which are constant and/or empty offer scope for "column reduction".
It is in these areas I would look for the (sometimes simple) guiding rules for removing (or changing) columns.

peterC

Catch up on SAS Innovate 2026

SAS Training: Just a Click Away