BookmarkSubscribeRSS Feed
Ani7
Obsidian | Level 7

I have a large dataset, 'have' (500 million rows), which I need to make some adjustments to before I run my analysis on it. The data set I have has an 'customer_id' variable that needs to be modified based on values from an external Excel file (has about 5000 rows). This Excel dataset currently looks like this:

 

customer_id1customer_id2combined
abccdeabc/cde
bbbbdcbbb/bdc

 

Essentially, using this table, I want to run the following code in a single data step to modify the dataset 'have' like this:

 

data want;
    set have;
    if customer_id in ("abc", "cde") then customer_id = "abc/cde";
    /*repeat this line 5000 times with all the values in the Excel file*/
run;

I was previously adding the entire "if-then" string as a variable in the Excel file and putting in to a macro variable using a select-into but unfortunately there's a length restriction on macro variables. I was wondering if there was a different way for me to print all 5000 if-then clauses in a single data step? I don't want to use a merge here.

4 REPLIES 4
Cynthia_sas
SAS Super FREQ

Hi:

  You might try a user-defined format. Here's a simple example. If you needed to you could make the user-defined format from your list in a program. This example doesn't have that, but it should give you an idea:

 data fakedata;
   infile datalines dlm=',';
   input ID $ name $ amt;
datalines;
abc,alan,100
cde,barb,200
efg,carl,300
fgh,edna,400
abg,fred,500
abc,gail,600
;
run;

proc format;
  value $change 'abc','cde'='abc/cde' 
                'efg' = 'efg/xxx';
run;

data new;
  length ID orig_ID$10;
  set fakedata;
  orig_id = id;
  ID = put(orig_id,$change.);
run;

proc print data=new;
  var ID name amt orig_ID;
run;

any values NOT in the user-defined format would retain their original values, as shown in the output.

 

Hope this helps,

Cynthia

Tom
Super User Tom
Super User

Looks like you just need a format.  With a format there is no need to actually change the data, just apply the format when you want to see the formatted values instead of the raw values.

 

To make a format from your data just convert it to a dataset in the format needed for a format definition using the CNTLIN= option of PROC FORMAT.

data have;
  length customer_id1 customer_id2 $20 combined $50 ;
  input customer_id1 customer_id2 combined;
cards;
abc cde abc/cde
bbb bdc bbb/bdc
;

data format;
  set have ;
  retain fmtname '$combined';
  start=customer_id1;
  label=combined;
  output;
  start=customer_id2;
  output;
  keep fmtname start label;
run;

proc format cntlin=format;
run;

proc print data=have;
  format customer_id: $combined.;
run;
       customer_    customer_
Obs       id1          id2       combined

 1      abc/cde      abc/cde     abc/cde
 2      bbb/bdc      bbb/bdc     bbb/bdc
ErikLund_Jensen
Rhodochrosite | Level 12

Hi @Ani7 

 

Why don't you want to use merge (or sql join, which amounts to the same)? - it is not only simpler and easier to write, it will run much faster. 500 million rows x 5000 if-thens sounds to me like a never-ending job 

 

With a SAS license you have already bought the world's best tools for this type of data manipulation, so why try to build your own out of odds and ends? - I am really curious.

DavePrinsloo
Pyrite | Level 9

Look into using hash tables. They are much faster than the other ways of doing this.   Google it!   
First read the excel sheet into a SAS table. If the example reflects the logic completely, you
would want to split each row into 2 lines

 

customer_id1 customer_id2 combined
abc cde abc/cde
bbb bdc bbb/bdc

 

becomes a sas table called  customer_lookup like

 

customer_id combined
abc abc/cde
cde abc/cde
bbb bbb/bdc
bdc bbb/bdc

 

 

/* this uses a macro that simplifies the code to define a hash table In your case, maybe good to look up the syntax! */

/* If interested, I can post the macro ut_hash_define */

data want;
    set have;

   %ut_hash_define(intable=customer_lookup,
     keys=customer_id,
     datacols=combined);

rc = customer_lookup.find();
run;

sas-innovate-white.png

Special offer for SAS Communities members

Save $250 on SAS Innovate and get a free advance copy of the new SAS For Dummies book! Use the code "SASforDummies" to register. Don't miss out, May 6-9, in Orlando, Florida.

 

View the full agenda.

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 4 replies
  • 1607 views
  • 1 like
  • 5 in conversation