I have a date set with oldvar comprised of a few thousand alphanumeric values. Big chunks of these values (60-100) represent respective organizations. I want to create newvar with a value based on the value of oldvar.
In a similar situation with only a few values for a variable I have done the below.
data want;
set have;
If oldvar in ("123abc" "abc234" "789ghi") then newvar=1;
else if oldvar in ("222xyz" "444qrs" "432opm") then newvar=2;
run;
In this case with a few thousand values I am trying to paste from an excel file, what is the most efficient way to do this without having to put quotation marks around every single value? Thank you.
@sasgorilla wrote:
Thank you, very helpful. I needed one more step and had one additional question.
Step: Sort by oldvar
*I received an error message because I hadn't sorted by the "by" variable in each data set. This was an easy fix by sorting each of the data sets by oldvar.
Question:
When I used your code with merge, the number of observations increased. Does this mean that there must be some "oldvar" in the "company_groups" that were not in the "have" dataset? Is there an easy way to test for this if many observations in the "have" dataset have a blank or missing value for oldvar?
Yes. To use MERGE the datasets must be sorted (or have an index that can be used to retrieve the values) in the order specified in the BY statement. You could use an SQL join instead, in which case SQL will sort the data for you.
If the number of observations increases there are two possible causes. Like you said it could be that some extra observations were added. But the subsetting IF in my code eliminates those.
data want;
merge have(in=in1) company_groups;
by oldvar;
if in1;
run;
The other reason would be that you have multiple observations with the same value of OLDVAR. Either just plain duplicate observations. Or OLDVAR values that are mapped to more than one NEWVAR value.
You could fix that by first eliminating the duplicates using PROC SORT.
proc sort data=company_groups nodupkey;
by oldvar;
run;
Leave the list in DATA, not CODE.
So make a dataset that has OLDVAR and NEWVAR
data company_groups;
input oldvar $ newvar ;
cards;
123abc 1
abc234 1
789ghi 1
222xyz 2
444qrs 2
432opm 2
;
then merge it with HAVE.
data want;
merge have(in=in1) company_groups;
by oldvar;
if in1;
run;
Thank you, very helpful. I needed one more step and had one additional question.
Step: Sort by oldvar
*I received an error message because I hadn't sorted by the "by" variable in each data set. This was an easy fix by sorting each of the data sets by oldvar.
Question:
When I used your code with merge, the number of observations increased. Does this mean that there must be some "oldvar" in the "company_groups" that were not in the "have" dataset? Is there an easy way to test for this if many observations in the "have" dataset have a blank or missing value for oldvar?
@sasgorilla wrote:
Thank you, very helpful. I needed one more step and had one additional question.
Step: Sort by oldvar
*I received an error message because I hadn't sorted by the "by" variable in each data set. This was an easy fix by sorting each of the data sets by oldvar.
Question:
When I used your code with merge, the number of observations increased. Does this mean that there must be some "oldvar" in the "company_groups" that were not in the "have" dataset? Is there an easy way to test for this if many observations in the "have" dataset have a blank or missing value for oldvar?
Yes. To use MERGE the datasets must be sorted (or have an index that can be used to retrieve the values) in the order specified in the BY statement. You could use an SQL join instead, in which case SQL will sort the data for you.
If the number of observations increases there are two possible causes. Like you said it could be that some extra observations were added. But the subsetting IF in my code eliminates those.
data want;
merge have(in=in1) company_groups;
by oldvar;
if in1;
run;
The other reason would be that you have multiple observations with the same value of OLDVAR. Either just plain duplicate observations. Or OLDVAR values that are mapped to more than one NEWVAR value.
You could fix that by first eliminating the duplicates using PROC SORT.
proc sort data=company_groups nodupkey;
by oldvar;
run;
You could make a informat by proc format or Hash Table.
data company_groups;
input start $ label ;
retain fmtname 'fmt' type 'i' ;
cards;
123abc 1
abc234 1
789ghi 1
222xyz 2
444qrs 2
432opm 2
;
proc format cntlin=company_groups;
run;
data have;
input oldvar $ ;
cards;
123abc
abc234
789ghi
222xyz
444qrs
432opm
;
data want;
set have;
newvar=input(oldvar,fmt.);
run;
@sasgorilla wrote:
In this case with a few thousand values I am trying to paste from an excel file, what is the most efficient way to do this without having to put quotation marks around every single value? Thank you.
See this explanation of using the %QLIST macro for avoiding adding quotation marks. https://communities.sas.com/t5/SAS-Programming/Assign-the-values-to-a-macro-variable/m-p/955804#M373...
Or as @Tom said, put these values in a data set rather than code.
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.