BookmarkSubscribeRSS Feed
Ani7
Obsidian | Level 7

I have a large dataset, 'have' (500 million rows), which I need to make some adjustments to before I run my analysis on it. The data set I have has an 'customer_id' variable that needs to be modified based on values from an external Excel file (has about 5000 rows). This Excel dataset currently looks like this:

 

customer_id1customer_id2combined
abccdeabc/cde
bbbbdcbbb/bdc

 

Essentially, using this table, I want to run the following code in a single data step to modify the dataset 'have' like this:

 

data want;
    set have;
    if customer_id in ("abc", "cde") then customer_id = "abc/cde";
    /*repeat this line 5000 times with all the values in the Excel file*/
run;

I was previously adding the entire "if-then" string as a variable in the Excel file and putting in to a macro variable using a select-into but unfortunately there's a length restriction on macro variables. I was wondering if there was a different way for me to print all 5000 if-then clauses in a single data step? I don't want to use a merge here.

4 REPLIES 4
Cynthia_sas
SAS Super FREQ

Hi:

  You might try a user-defined format. Here's a simple example. If you needed to you could make the user-defined format from your list in a program. This example doesn't have that, but it should give you an idea:

 data fakedata;
   infile datalines dlm=',';
   input ID $ name $ amt;
datalines;
abc,alan,100
cde,barb,200
efg,carl,300
fgh,edna,400
abg,fred,500
abc,gail,600
;
run;

proc format;
  value $change 'abc','cde'='abc/cde' 
                'efg' = 'efg/xxx';
run;

data new;
  length ID orig_ID$10;
  set fakedata;
  orig_id = id;
  ID = put(orig_id,$change.);
run;

proc print data=new;
  var ID name amt orig_ID;
run;

any values NOT in the user-defined format would retain their original values, as shown in the output.

 

Hope this helps,

Cynthia

Tom
Super User Tom
Super User

Looks like you just need a format.  With a format there is no need to actually change the data, just apply the format when you want to see the formatted values instead of the raw values.

 

To make a format from your data just convert it to a dataset in the format needed for a format definition using the CNTLIN= option of PROC FORMAT.

data have;
  length customer_id1 customer_id2 $20 combined $50 ;
  input customer_id1 customer_id2 combined;
cards;
abc cde abc/cde
bbb bdc bbb/bdc
;

data format;
  set have ;
  retain fmtname '$combined';
  start=customer_id1;
  label=combined;
  output;
  start=customer_id2;
  output;
  keep fmtname start label;
run;

proc format cntlin=format;
run;

proc print data=have;
  format customer_id: $combined.;
run;
       customer_    customer_
Obs       id1          id2       combined

 1      abc/cde      abc/cde     abc/cde
 2      bbb/bdc      bbb/bdc     bbb/bdc
ErikLund_Jensen
Rhodochrosite | Level 12

Hi @Ani7 

 

Why don't you want to use merge (or sql join, which amounts to the same)? - it is not only simpler and easier to write, it will run much faster. 500 million rows x 5000 if-thens sounds to me like a never-ending job 

 

With a SAS license you have already bought the world's best tools for this type of data manipulation, so why try to build your own out of odds and ends? - I am really curious.

DavePrinsloo
Pyrite | Level 9

Look into using hash tables. They are much faster than the other ways of doing this.   Google it!   
First read the excel sheet into a SAS table. If the example reflects the logic completely, you
would want to split each row into 2 lines

 

customer_id1 customer_id2 combined
abc cde abc/cde
bbb bdc bbb/bdc

 

becomes a sas table called  customer_lookup like

 

customer_id combined
abc abc/cde
cde abc/cde
bbb bbb/bdc
bdc bbb/bdc

 

 

/* this uses a macro that simplifies the code to define a hash table In your case, maybe good to look up the syntax! */

/* If interested, I can post the macro ut_hash_define */

data want;
    set have;

   %ut_hash_define(intable=customer_lookup,
     keys=customer_id,
     datacols=combined);

rc = customer_lookup.find();
run;

Ready to join fellow brilliant minds for the SAS Hackathon?

Build your skills. Make connections. Enjoy creative freedom. Maybe change the world. Registration is now open through August 30th. Visit the SAS Hackathon homepage.

Register today!
How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 4 replies
  • 910 views
  • 1 like
  • 5 in conversation