BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
JackoNewbie
Calcite | Level 5

I have been stuck on a piece of code for a long time and would appreciate any help I can get? Using SAS enterprise guide. I would like to learn how to split a dataset I have by number of observations for a specific variable. The variable is really long and unique to a product. As the variable is so long I can not simply use a sub set to reference the variable in the next piece of code or use a proc sql to count the variable in to a macro variable. Both would either reach its limit or take far too long to run. The main issue I have is I want this all to be automatically counted and calculated by the sas code(avoid having to manually change the code to create another dataset for new variables or have to state how many smaller datasets it has split the data in to), there are already 29000 of this variable and it will only get bigger, the variable is seriously long (20-30 characters) and I would like the split down datasets to be able to use in macro variable references in the next two pieces of code.

 

EG. I have a dataset with 6000 different product numbers. I want sas to split this variable down by 1500 into smaller datasets. So this would mean it would put the first 1500 into dataset 1, the next 1500 into dataset 2 etc. If the following week another 1500 is added to the dataset, the code would pick this up and split the final 1500 into dataset 5. The smaller datasets would then be used in macro variable for other pieces of code I have. So dataset 1 could be referenced &PN1, dataset 2 &PN2 etc. It would ideally be able to tell how many smaller datasets there are and reference the smaller datasets in a macro variable without having myself manually entering each smaller dataset.

 

example of what I want to achieve. This example does not take in to the account the actual product number I have is really long and there are over 29000 observations.

Have:

data work.class;

input ProductNumber:32.;

1234

1345

1347

1889

1009

2234

;;;;

 

data work.class;

input ProductNumber:32.;

1234

1345

1347

1889

1009

2234

;;;;

data work.class;

input ProductNumber:32.;

1234

1345

1347

1889

1009

2234

;;;;

 

data work.class;

input ProductNumber:32.;

1234

1345

1347

1889

1009

2234

 ;;;;

 

Want

The above split by 2 observations and be put into macrovariables to be used.

1234 and 1345 into &PM1

1347 and 1889 into &PM2

1009 and 2234 into &PM3

If the following week there are two more added then it will automatically pick this up and put them into &PM4. 

 

If the next piece of code was :

proc sql;

select * 

from madeupset

where product number in ( &PM1, &PM2, &PM3, &PM4)

;quit;

 Would there be a way of referencing those smaller datasets without having to manually enter them when they are created?

 

Apologies if this is confusing. It is my first post and I will be very happy to provide anything else needed.

 

Thanks!

JackoNewbie

 

1 ACCEPTED SOLUTION

Accepted Solutions
Tom
Super User Tom
Super User

You probably need to find another way to do your query since most systems are also going to start complaining if you give them really long lists of constant values.  What system are you dataset in? Are they SAS datasets? Are they in a database like Oracle?

 

Note that there is no need to make datasets with the "split" lists if the goal is to make MACRO variables with the split list.

For example you could just use dataset options to limit which values go into which macro variable.

select quote(strip(product_number,"'") 
  into :PM0 seperated by ','
  from PMLIST(firstobs=0001 obs=1000)
;
select quote(strip(product_number,"'") 
  into :PM1 seperated by ','
  from PMLIST(firstobs=1001 obs=2000)
;
...

Or use CALL SYMPUTX() to make the macro variables in one pass of the list.

So this datastep will split list of product numbers from a dataset named PMLIST into a series of macro variables named PM1 , PM2, ... and also generate a macro variable named PMLIST that references all of the other generated macro variables.

data _null_;
  length pmlist $32000 mvarlist $32000;
  retain mvarlist;
  mvar+1;
  do until(length(pmlist)>31000);
    set pmlist end=eof;
    pmlist=catx(',',pmlist,quote(trim(product_number),"'"));
  end;
  call symputx(cats('PM',mvar),pmlist);
  mvarlist=catx(',',mvarlist,cats('&PM',mvar));
  if eof then call symputx('pmlist',mvarlist);
run;

Then your query is just:

where product_number in (&PMLIST)

View solution in original post

15 REPLIES 15
Reeza
Super User

Are you doing this for a SQL Pass through query? If so there are better ways, a macro for example that automatically creates that list on the fly so you don't run into any issues with macro variable size. Or using a subquery depending if the data is in the same warehouse.

where product_number in (Select ID from pm)

 

 

 

Spoiler

@JackoNewbie wrote:

I have been stuck on a piece of code for a long time and would appreciate any help I can get? Using SAS enterprise guide. I would like to learn how to split a dataset I have by number of observations for a specific variable. The variable is really long and unique to a product. As the variable is so long I can not simply use a sub set to reference the variable in the next piece of code or use a proc sql to count the variable in to a macro variable. Both would either reach its limit or take far too long to run. The main issue I have is I want this all to be automatically counted and calculated by the sas code(avoid having to manually change the code to create another dataset for new variables or have to state how many smaller datasets it has split the data in to), there are already 29000 of this variable and it will only get bigger, the variable is seriously long (20-30 characters) and I would like the split down datasets to be able to use in macro variable references in the next two pieces of code.

 

EG. I have a dataset with 6000 different product numbers. I want sas to split this variable down by 1500 into smaller datasets. So this would mean it would put the first 1500 into dataset 1, the next 1500 into dataset 2 etc. If the following week another 1500 is added to the dataset, the code would pick this up and split the final 1500 into dataset 5. The smaller datasets would then be used in macro variable for other pieces of code I have. So dataset 1 could be referenced &PN1, dataset 2 &PN2 etc. It would ideally be able to tell how many smaller datasets there are and reference the smaller datasets in a macro variable without having myself manually entering each smaller dataset.

 

example of what I want to achieve. This example does not take in to the account the actual product number I have is really long and there are over 29000 observations.

Have:

data work.class;

input ProductNumber:32.;

1234

1345

1347

1889

1009

2234

;;;;

 

data work.class;

input ProductNumber:32.;

1234

1345

1347

1889

1009

2234

;;;;

data work.class;

input ProductNumber:32.;

1234

1345

1347

1889

1009

2234

;;;;

 

data work.class;

input ProductNumber:32.;

1234

1345

1347

1889

1009

2234

 ;;;;

 

Want

The above split by 2 observations and be put into macrovariables to be used.

1234 and 1345 into &PM1

1347 and 1889 into &PM2

1009 and 2234 into &PM3

If the following week there are two more added then it will automatically pick this up and put them into &PM4. 

 

If the next piece of code was :

proc sql;

select * 

from madeupset

where product number in ( &PM1, &PM2, &PM3, &PM4)

;quit;

 Would there be a way of referencing those smaller datasets without having to manually enter them when they are created?

 

Apologies if this is confusing. It is my first post and I will be very happy to provide anything else needed.

 

Thanks!

JackoNewbie

 


 

 

JackoNewbie
Calcite | Level 5

Hi Reeza,

 

Thanks for the super quick reply.

 

Yes it is and I have tried the subquery way but it takes far too long which is why I would like the sets broken down by number. When I do this manually and reference the smaller sets broken down it runs super super quick. I am just trying to find a way for this to be done without having to manually edit the code. I do definitely need a macro of some sort.

JackoNewbie
Calcite | Level 5

additional info - 

 

You are correct in my code I am combining the datasets by putting 'where Product_number in (&PM1, &PM2 etc)'. I would like to avoid this manual step of entering the amount of datasets I need to include and possibly use a macro to automatically calculate this itself and use in code. 

 

The reason I am doing this is to avoid the lengthy time it takes to complete this by simply doing a sub query. It could reduce the time it takes from an hour to just 5 minutes for example.

 

I was meaning that the example product 1234 is actually a longer value. Such as 12345678910111213141516171819202122 and there are around 30000 of these value which I need to reference. So this is why I am splitting the dataset into smaller ones as It reaches a limit when I use in a macro variable. I couldn't just use one macro variable list to referent all products as it reaches a limit (can't fit it all in &PM1). 

 

If I has a dataset with a list of 30,000 different product_numbers which are the same length as 12345678910111213141516171819202122. I would use the below code to split them down and use in a different dataset but this is manual and would like it to be automated.

 

data PM1 PM2 PM3;
if _n_ le 10000 then output PM1;
else if _n_ le 20000 then output PM2;
else if _n_ le 30000 then output PM3;
run;



proc sql noprint;
select quote(strip(product_number,"'") into:PM1 seperated by ', '
from PM1;
quit;
run;
%put PM1 = &PM1

proc sql noprint;
select quote(strip(product_number,"'") into:PM2 seperated by ', '
from PM2;
quit;
run;
%put PM2 = &PM2

proc sql noprint;
select quote(strip(product_number,"'") into:PM3 seperated by ', '
from PM3;
quit;
run;
%put PM3 = &PM3

/*the above would not let me split by 10000 in my real code as splitting by 10000 would exceed the maximum length (65534)

proc sql;
create table getextrainfo as
select *
from tableheldwithextrainfo
where product_number in (&PM1, &PM2, &PM3)
;quit;

I hope this helps. Again I am happy to provide more information.

 

ballardw
Super User

I can't see anything in your description that actually requires a splitting up of the data, much less to continue to add complexity by creating ever more data sets.

Without any rule related to the actual VALUES of the variable going to specific sets then to get all records for a single "product" the first thing you would be doing constantly is combining all of the data sets.

 

Generally there are two main cases when I consider "splitting data". The first is when I am forced to provide an EXCEL or similar spreadsheet and my data contains more variables or records than can be placed into a single sheet of a spreadsheet.

The other is when the actual structure/ layout or coding values of a data description changes.

 

One tends to suspect there is much left out of your problem discussion and much of it might be addressed if a more complete description is provided.

 

By "The variable is really long and unique to a product." do you mean that a single product identifier has multiple long associated values? Or is there just one long value. An example of the later is product 1234 has "long value" such as "Widget X for purpose Y" and only that long value. If this is the case then your PRODUCT code already contains all of the information you ever need. You could use the information to create a Format for displaying the "long" value instead of the raw code for most purposes.

 

I am not sure why you say "the variable is seriously long (20-30 characters)". Since the length of a single SAS character variable can go to 32,767 characters that is not particularly long.

JackoNewbie
Calcite | Level 5

Thanks Ballard for the reply and apologies again as this is my first post.

 

You are correct in my code I am combining the datasets by putting 'where Product_number in (&PM1, &PM2 etc)'. I would like to avoid this manual step of entering the amount of datasets I need to include and possibly use a macro to automatically calculate this itself and use in code. 

 

The reason I am doing this is to avoid the lengthy time it takes to complete this by simply doing a sub query. It could reduce the time it takes from an hour to just 5 minutes for example.

 

I was meaning that the example product 1234 is actually a longer value. Such as 12345678910111213141516171819202122 and there are around 30000 of these value which I need to reference. So this is why I am splitting the dataset into smaller ones as It reaches a limit when I use in a macro variable. I couldn't just use one macro variable list to referent all products as it reaches a limit (can't fit it all in &PM1). 

 

If I has a dataset with a list of 30,000 different product_numbers which are the same length as 12345678910111213141516171819202122. I would use the below code to split them down and use in a different dataset but this is manual and would like it to be automated.

 

data PM1 PM2 PM3;
if _n_ le 10000 then output PM1;
else if _n_ le 20000 then output PM2;
else if _n_ le 30000 then output PM3;
run;



proc sql noprint;
select quote(strip(product_number,"'") into:PM1 seperated by ', '
from PM1;
quit;
run;
%put PM1 = &PM1

proc sql noprint;
select quote(strip(product_number,"'") into:PM2 seperated by ', '
from PM2;
quit;
run;
%put PM2 = &PM2

proc sql noprint;
select quote(strip(product_number,"'") into:PM3 seperated by ', '
from PM3;
quit;
run;
%put PM3 = &PM3

/*the above would not let me split by 10000 in my real code as splitting by 10000 would exceed the maximum length (65534)

proc sql;
create table getextrainfo as
select *
from tableheldwithextrainfo
where product_number in (&PM1, &PM2, &PM3)
;quit;

I hope this helps. Again I am happy to provide more information.

 

Thanks

Tom
Super User Tom
Super User

You probably need to find another way to do your query since most systems are also going to start complaining if you give them really long lists of constant values.  What system are you dataset in? Are they SAS datasets? Are they in a database like Oracle?

 

Note that there is no need to make datasets with the "split" lists if the goal is to make MACRO variables with the split list.

For example you could just use dataset options to limit which values go into which macro variable.

select quote(strip(product_number,"'") 
  into :PM0 seperated by ','
  from PMLIST(firstobs=0001 obs=1000)
;
select quote(strip(product_number,"'") 
  into :PM1 seperated by ','
  from PMLIST(firstobs=1001 obs=2000)
;
...

Or use CALL SYMPUTX() to make the macro variables in one pass of the list.

So this datastep will split list of product numbers from a dataset named PMLIST into a series of macro variables named PM1 , PM2, ... and also generate a macro variable named PMLIST that references all of the other generated macro variables.

data _null_;
  length pmlist $32000 mvarlist $32000;
  retain mvarlist;
  mvar+1;
  do until(length(pmlist)>31000);
    set pmlist end=eof;
    pmlist=catx(',',pmlist,quote(trim(product_number),"'"));
  end;
  call symputx(cats('PM',mvar),pmlist);
  mvarlist=catx(',',mvarlist,cats('&PM',mvar));
  if eof then call symputx('pmlist',mvarlist);
run;

Then your query is just:

where product_number in (&PMLIST)
JackoNewbie
Calcite | Level 5

Hi Tom,

 

This is super useful to me!

 

The set is in SAS. SAS EG. 

 

I tried this code and it is showing an error when I am trying to run the final query with: where product_number in (&PMLIST)

 

The error is that pmlist is not being referenced properly. 'Warning: Apparent symbolic reference PMLIST not resolved.' From the code you sent, the only section I changed was where the dataset was stored. So in the 'SET' section I changed the dataset 'pmlist' to my own dataset.

 

When I refeence &PM1 &PM2 etc it works. Just when it is &PMLIST it does not get referenced properly.

'where product_number in (&PM1)'

 

Thanks!

Tom
Super User Tom
Super User

Hard to tell without seeing your code. Most likely you accidentally removed the END= option from the SET statement so the CALL SYMPUTX() function that creates that macro variable never runs.

If not then change the condition for the DO loop to also test the EOF variable.

until(eof or length(pmlist)>31000)
Reeza
Super User
Run just the data _null_ step and then check if the macro variable is being created.

So the data _null_ step above and then:

%PUT &pmlist.
JackoNewbie
Calcite | Level 5

Hi Reeza,

 

I need to add: 

until(eof or length(pmlist)>31000)

 

Thanks for your replies and help  

JackoNewbie
Calcite | Level 5

Fantastic Tom! All I needed to do was add :

 

until(eof or length(pmlist)>31000) 

 

Thanks for all your help with this 😄 

 

Jacko

 

 

JackoNewbie
Calcite | Level 5

Hi @Tom ,

 

When you said 'You probably need to find another way to do your query since most systems are also going to start complaining if you give them really long lists of constant values.' Did you mean the piece of code you showed me could fail at some point? Would it ever reach a 'limit'?

 

Thanks

Jacko

novinosrin
Tourmaline | Level 20

Hi and welcome to SAS. Can you make your question simpler by giving us the following plz

 

1. A sample data of what you HAVE's (paste here as text for us to copy paste and work with it)

2. A sample data of what you WANT's(your expected output for the input sample)

3. A brief explanation of the convert/business logic and that is your objective

 

I am sure somebody will offer you the coding solution. If i have time, I'll give it a shot!

JackoNewbie
Calcite | Level 5

Hi Novinosrin,

 

Sorry for the lack of information. Will the above response to another member be enough info?

 

 

Ready to join fellow brilliant minds for the SAS Hackathon?

Build your skills. Make connections. Enjoy creative freedom. Maybe change the world. Registration is now open through August 30th. Visit the SAS Hackathon homepage.

Register today!
Mastering the WHERE Clause in PROC SQL

SAS' Charu Shankar shares her PROC SQL expertise by showing you how to master the WHERE clause using real winter weather data.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 15 replies
  • 11880 views
  • 4 likes
  • 5 in conversation