BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
Dsrountree
Obsidian | Level 7

What is the easiest way to parse a data set and create multiple data sets based on a given value?

 

Example: 

Let’s use telephone numbers and look for all values found in PHONENUMBER beginning with

631 and put them in data set A

516 and put them in data set B

212 and put them in data set C

 

Data:

631 555 5555

631 666 6666

516 999 9999

516 888 8888

212 444 4444

212 444 5555

 

Problem: 

You don’t know the first 3 values – If I did – I would write a substring/do loop to read the first 3 of the field “PHONENUMBER” and tell it to read through the table and when it finds a matching value “substr (phonenumber,1,3) write it out to the associated table.

 

What function would you use to search the field PHONENUMBER and look for all like values from a specific position and place those values in a separate data set?

 

This is without defining the values to look for since the list of area codes is quite large.

 

All help is greatly appreciated – I haven’t been in the SAS world for over 2 years so I’m a little rusty – Thanks in advance

 

 

1 ACCEPTED SOLUTION

Accepted Solutions
Oligolas
Barite | Level 11

 

Hi,

 

I won't use hash for this, since the assignment is not known. This is not a classical look up situation.

Instead I would just program where to output the rows and use a simple call execute to perform the job:

 

data test;
infile datalines ;
input phonenumber $1-20 @;
datalines;
631 555 5555
631 666 6666
516 999 9999
516 888 8888
212 444 4444
212 444 5555
;
run;

data test1;
   set test;
   first3=substr(phonenumber,1,3);
run;
proc sort data=test1;
   by first3 phonenumber;
run;

data test2;
   set test1;
   length ds $10;
   by first3;
   retain count .;
   if first.first3 then count+1;
   if count>26 then ds=repeat('a',int(count/26));
   ds=strip(ds)||byte(96+count);  
run;

data _NULL_;
   set test2;
   by first3;
   if first.first3;
   call execute('data '||strip(ds)||'(drop=first3 count ds); set test2; where ds eq "'||strip(ds)||'"; run;');
run;


 

 

________________________

- Cheers -

View solution in original post

8 REPLIES 8
Astounding
PROC Star

The most elegant way would be to use hashing.  Since I'm out of my element there, here's a link to a similar question ... refer to the final suggested solution.

 

https://communities.sas.com/t5/Base-SAS-Programming/create-a-new-Dataset-for-every-unique-Country-an...

 

Your situation may be mildly more complex, since the area code itself would not be a valid name for a SAS data set.

 

It could be done without hashing, using macro language to extract all the area codes, place a set of names on the DATA statement, and construct a series of IF/THEN statements to control outputting. 

Dsrountree
Obsidian | Level 7

If I was only looking at a distinct value like zipcode then I could use code like that.

The problem is I need to evalute the value based on a given position and separate accordingly.

 

Any suggestions...

Astounding
PROC Star

You'll have to cross that bridge, no matter what the approach.  It might be easier to plan:

 

if the phone number begins with 631, put it into data set named _631

if the phone number begins with 516, put it into data set named _516

 

So if using hashing, you might have to construct the key to the hash table, concatenating an underscore and the area code (i think again out of my element here).  But if you were using macro language, you would need to do something similar to get the names of the data sets, for example:

 

proc sql noprint;

select distinct '_' || substr(phone, 1, 3) into : data_set_list separated by ' ' from have;

quit;

 

Then the macro language approach would use:

 

data &data_set_list;

 

 

Oligolas
Barite | Level 11

 

Hi,

 

I won't use hash for this, since the assignment is not known. This is not a classical look up situation.

Instead I would just program where to output the rows and use a simple call execute to perform the job:

 

data test;
infile datalines ;
input phonenumber $1-20 @;
datalines;
631 555 5555
631 666 6666
516 999 9999
516 888 8888
212 444 4444
212 444 5555
;
run;

data test1;
   set test;
   first3=substr(phonenumber,1,3);
run;
proc sort data=test1;
   by first3 phonenumber;
run;

data test2;
   set test1;
   length ds $10;
   by first3;
   retain count .;
   if first.first3 then count+1;
   if count>26 then ds=repeat('a',int(count/26));
   ds=strip(ds)||byte(96+count);  
run;

data _NULL_;
   set test2;
   by first3;
   if first.first3;
   call execute('data '||strip(ds)||'(drop=first3 count ds); set test2; where ds eq "'||strip(ds)||'"; run;');
run;


 

 

________________________

- Cheers -

Dsrountree
Obsidian | Level 7

Nicely done....

But quick question on data _NULL_;

 

If I have several other columns,

IE: First Name, Last Name, Address.

 

Will that information also carry over to table A, B, C?

Oligolas
Barite | Level 11

sure, it's just droping the unwanted variables, the other variables remain where they are.

________________________

- Cheers -

Astounding
PROC Star

Looks like nobody is stepping up with a hashing solution.  Here's how I would approach a CALL EXECUTE solution:

 

proc freq data=have;

   tables phone_number / noprint out=counts (keep=phone_number);

   format phone_number $3.;

run;

 

data _null_;

   length phone_number $ 3;

   call execute ('data ');

   do until (done1);

      set counts end=done1;

      call execute('_' || phone_number);

   end;

   call execute('; set have; select (substr(phone_number, 1, 3));');

   do until (done2);

      set counts end=done2;

      call execute('when ("' || phone_number || '") output _' || phone_number || ';');

   end;

   call execute('end; run;');

   stop;

run;

 

It's untested, but looks like a viable approach.

Quentin
Super User

I like the hash approach for dynamic splitting, see e.g. Dorfman et al

http://www.lexjansen.com/nesug/nesug09/hw/HW04.pdf.

 

Below uses a single hash table and a DoW loop, so data must be sorted by the key. The paper also describes a Hash-of-Hashes approach if the data can't be sorted.

 

Process below is:

  1. Assign a key (first three digits of phone number or whatever you want to split by).
  2. Sort by the key.
  3. Use a DoW loop to read all data for a by-group, loading it into a hash table.
  4. Output the hash table to a dataset.
  5. Clear the hash table.
  6. Goto #3

 

data have;
  infile datalines ;
  input phonenumber $1-20 ;

  *define a key;
  mykey=substr(phonenumber,1,3);

datalines;
631 555 5555
631 666 6666
516 999 9999
516 888 8888
212 444 4444
212 444 5555
111 111 1111
999 999 9999
;
run;

proc sort data=have;
  by mykey;
run;

data _null_ ;
  if 0 then set have;
  if _n_=1 then do;
    declare hash h();
    h.definekey ("_n_") ; 
    h.definedata ("phonenumber" ) ;
    h.definedone () ;
  end;
  do _n_ = 1 by 1 until ( last.mykey ) ;
    set have ;
    by mykey ;
    h.add() ;
  end ;
  h.output (dataset: cats("OUT_",mykey) ) ;
  h.clear() ;
run ;
BASUG is hosting free webinars Next up: Jane Eslinger presenting PROC REPORT and the ODS EXCEL destination on Mar 27 at noon ET. Register now at the Boston Area SAS Users Group event page: https://www.basug.org/events.

sas-innovate-2024.png

Join us for SAS Innovate April 16-19 at the Aria in Las Vegas. Bring the team and save big with our group pricing for a limited time only.

Pre-conference courses and tutorials are filling up fast and are always a sellout. Register today to reserve your seat.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 8 replies
  • 1052 views
  • 3 likes
  • 4 in conversation