I have a large dataset, 'have' (500 million rows), which I need to make some adjustments to before I run my analysis on it. The data set I have has an 'customer_id' variable that needs to be modified based on values from an external Excel file (has about 5000 rows). This Excel dataset currently looks like this:
customer_id1 | customer_id2 | combined |
abc | cde | abc/cde |
bbb | bdc | bbb/bdc |
Essentially, using this table, I want to run the following code in a single data step to modify the dataset 'have' like this:
data want;
set have;
if customer_id in ("abc", "cde") then customer_id = "abc/cde";
/*repeat this line 5000 times with all the values in the Excel file*/
run;
I was previously adding the entire "if-then" string as a variable in the Excel file and putting in to a macro variable using a select-into but unfortunately there's a length restriction on macro variables. I was wondering if there was a different way for me to print all 5000 if-then clauses in a single data step? I don't want to use a merge here.
Hi:
You might try a user-defined format. Here's a simple example. If you needed to you could make the user-defined format from your list in a program. This example doesn't have that, but it should give you an idea:
data fakedata;
infile datalines dlm=',';
input ID $ name $ amt;
datalines;
abc,alan,100
cde,barb,200
efg,carl,300
fgh,edna,400
abg,fred,500
abc,gail,600
;
run;
proc format;
value $change 'abc','cde'='abc/cde'
'efg' = 'efg/xxx';
run;
data new;
length ID orig_ID$10;
set fakedata;
orig_id = id;
ID = put(orig_id,$change.);
run;
proc print data=new;
var ID name amt orig_ID;
run;
any values NOT in the user-defined format would retain their original values, as shown in the output.
Hope this helps,
Cynthia
Looks like you just need a format. With a format there is no need to actually change the data, just apply the format when you want to see the formatted values instead of the raw values.
To make a format from your data just convert it to a dataset in the format needed for a format definition using the CNTLIN= option of PROC FORMAT.
data have;
length customer_id1 customer_id2 $20 combined $50 ;
input customer_id1 customer_id2 combined;
cards;
abc cde abc/cde
bbb bdc bbb/bdc
;
data format;
set have ;
retain fmtname '$combined';
start=customer_id1;
label=combined;
output;
start=customer_id2;
output;
keep fmtname start label;
run;
proc format cntlin=format;
run;
proc print data=have;
format customer_id: $combined.;
run;
customer_ customer_ Obs id1 id2 combined 1 abc/cde abc/cde abc/cde 2 bbb/bdc bbb/bdc bbb/bdc
Hi @Ani7
Why don't you want to use merge (or sql join, which amounts to the same)? - it is not only simpler and easier to write, it will run much faster. 500 million rows x 5000 if-thens sounds to me like a never-ending job
With a SAS license you have already bought the world's best tools for this type of data manipulation, so why try to build your own out of odds and ends? - I am really curious.
Look into using hash tables. They are much faster than the other ways of doing this. Google it!
First read the excel sheet into a SAS table. If the example reflects the logic completely, you
would want to split each row into 2 lines
customer_id1 | customer_id2 | combined |
abc | cde | abc/cde |
bbb | bdc | bbb/bdc |
becomes a sas table called customer_lookup like
customer_id | combined |
abc | abc/cde |
cde | abc/cde |
bbb | bbb/bdc |
bdc | bbb/bdc |
/* this uses a macro that simplifies the code to define a hash table In your case, maybe good to look up the syntax! */
/* If interested, I can post the macro ut_hash_define
*/
data want;
set have;
%ut_hash_define(intable=customer_lookup,
keys=customer_id,
datacols=combined);
rc = customer_lookup.find();
run;
Save $250 on SAS Innovate and get a free advance copy of the new SAS For Dummies book! Use the code "SASforDummies" to register. Don't miss out, May 6-9, in Orlando, Florida.
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.