Solved: Splitting SAS dataset by multiple varying conditions

arbnmedic33 · Posted 05-23-2022 10:52 AM

Greetings, first time poster, novice/intermediate user. I have a SAS data set consisting of the following variables:

NIIN PLANT SLOC replen2 COUNT NUM

NIIN is the part number, all distinct entries

PLANT is the source, all the same (2001)

SLOC is the responsible entity for that NIIN (271 distinct entries)

COUNT is the numeric count of individual SLOC in the file (i.e. SLOC ABCD has 76,890 entries, EFGH has 250, etc.)

NUM is the count from 1 to total observations in the file (1,922,802)

From this data, I need to create SAS data sets based on the following criteria:

Total NUM in the new datasets can't exceed 500,000 per data set

The SLOC entry can't be divided between the data sets (all ABCD must go into one data set...while not exceeding the 500,000 obs limit per set)

The issue is when I extract the first 500,000 obs, it falls right in the middle of one of the SLOCs.

How do I configure SAS to create datasets based on the criteria mentioned above?

The input file (NIIN PLANT SLOC replen2 COUNT NUM) will change each time I run the program.

Thanks in advance for any assistance.

Tom · Posted 05-23-2022 12:00 PM

So you only care that the SLOC entries are not split? Doesn't matter if the values of the other variables are scatter across multiple files?

First get the number of observations per value of SLOC. (If that is not already one of the variables you have).

proc freq data=HAVE;
  tables sloc / noprint out=counts ;
run;

Now assign them into groups so that no group has more than 500,000.

data groups;
   if eof then call symputx('ngroups',group);
   set counts end=eof;
   retain group 0 running_count 0;
   if count + running_count > 500000 then do;
      group+1;
      running_count=0;
   end;
   running_count+count;
run;

Now you can write a data step that will spit.

So let's assume SLOC is a character variable.

filename code ;
data _null_;
   file code ;
   if _n_=1 then put "data group1-group&ngroups; set have;";
   set groups end=eof;
   by group;
   if first.group then put 'if sloc in (' @;
   put sloc :$quote. @;
   if last.group then put ') then output group' group ';' ;
   if eof then put 'run;';
run;

Now you can run that code and create the &Ngroups datasets.

%include code/ source2;

View solution in original post

PaigeMiller · Posted 05-23-2022 11:05 AM

This is really hard to understand. Can you create a small example, let's say 50 observations, that illustrates what you want? Show us the example data as SAS data step code, which you can type in yourself of by following these instructions (and not via other methods); and then show us the desired output.

--
Paige Miller

arbnmedic33 · Posted 05-23-2022 11:09 AM

Apologies... Yes, give me a minute and I will put something together.

ballardw · Posted 05-23-2022 11:08 AM

Can you describe why exactly you need to create different data sets? And possibly why the 500,000 record limit?

For use in SAS it is usually much easier to leave the data in one set and use either a BY statement to process groups of related records based on values of the variables or a WHERE statement to select specific records for a particular use.

This becomes even more of an issue if, as you say, the records change constantly, i.e. each time you run the program.

arbnmedic33 · Posted 05-23-2022 12:33 PM

The data sets created from the original input file are to be exported to Excel and submitted into another gov system that is set up to receive only up to 500,000 records at a time (there are 1.9M observations in the input data) and the SLOC can't be split up between the new sets due to the requirements of that system. The renaming of the variables is also required by that system. Code is (so far):

DATA A; INFILE 'C:\Users\ra\Documents\My SAS Files\SAS 9.4 Data\text\pdt_output_trimmed.txt' 
	TRUNCOVER FIRSTOBS = 2;
input	@1  NIIN $9.	
	@11 PLANT $4.	
	@16 SLOC $4.	
	@20 PDT $3.	
;
RUN;

DATA B; SET A; if PLANT = '2001';
	RENAME NIIN = niin PLANT = plant SLOC = sloc PDT = replen2;
	PROC SORT DATA = B; BY sloc NIIN;
RUN;

DATA C; SET B; BY sloc NIIN;
IF FIRST.sloc THEN COUNT = 1; ELSE COUNT = COUNT +1; RETAIN COUNT;
NUM+1;
RUN;

The COUNT and NUM are my initial attempts to somehow identify the data for the subsets. The splits for the new datasets must occur between the different SLOCs and must not exceed the 500,000 obs limit. Problem is when I go to create the new data sets based on the number of observations, the splits occur in the middle of the SLOC, splitting that SLOC between two new SAS data sets. Sorry of this seems convoluted, difficult to explain. Just getting back into writing code after a few years away.

Example of the input data is as follows: (actual input .txt file has 1.9M obs) and as mentioned before, this initial input data will change each month.

NIIN	 	PLANT	SLOC	replen2	     COUNT	NUM	
013337632	2001	A011	25	       1	1
013569098	2001	A011	25	       2	2
014771660	2001	A011	25	       3	3
3014935859	2001	A011	25	       4	4
015004619	2001	A011	25	       5	5
015062715	2001	A011	25	       6	6
015064129	2001	A011	25	       7	7
015064131	2001	A011	25	       8	8
015064133	2001	A011	25	       9	9
015066873	2001	A011	25	       10	10
000000079	2001	A091	8	       1	11
000045256	2001	A091	8	       2	12
000050753	2001	A091	9	       3	13
000103030	2001	A091	8	       4	14
000115093	2001	A091	8	       5	15
000146763	2001	A091	8	       6	16
000179551	2001	A091	8	       7	17
000182296	2001	A091	8	       8	18
000187988	2001	A091	8	       9	19
000187989	2001	A091	8	       10	20
000599264	2001	A095	8	       1	21
000680509	2001	A095	8	       2	22
000686654	2001	A095	8	       3	23
000712075	2001	A095	8	       4	24
000712513	2001	A095	8	       5	25
000712515	2001	A095	8	       6	26
000811577	2001	A095	8	       7	27
000813381	2001	A095	9	       8	28
000847436	2001	A095	8	       9	29
000881251	2001	A095	9	      10	30
000888863	2001	A097	8	       1	31
000889167	2001	A097	8	       2        32
001000643	2001	A097	9	       3	33
001003095	2001	A097	8	       4	34
001005920	2001	A097	8	       5	35
001005937	2001	A097	8	       6	36
001045143	2001	A097	11	       7	37
001116255	2001	A097	9	       8	38
001217929	2001	A097	8	       9	39
001275322	2001	A097	8	      10	40
001278684	2001	A099	11	       1	41
001304235	2001	A099	8	       2	42
001319915	2001	A099	8	       3	43
001325317	2001	A099	8	       4	44
001349098	2001	A099	9	       5	45
001423036	2001	A099	8	       6	46
001433159	2001	A099	8	       7	47
001440091	2001	A099	22	       8	48
001503838	2001	A099	9              9	49
001513115	2001	A099	9	      10	50
001528353	2001	A123	8	       1	51
001648881	2001	A123	8	       2	52
001668567	2001	A123	9	       3	53
001670804	2001	A123	8	       4	54
001691934	2001	A123	9	       5	55
001691935	2001	A123	8	       6	56
001720031	2001	A132	8	       7	57
001727223	2001	A123	9	       8	58
001731881	2001	A123	8	       9	59
001744311	2001	A123	8	      10	60
001805038	2001	B111	8	       1	61
001827475	2001	B111	9	       2	62
001856345	2001	B111	9	       3	63
001879528	2001	B111	8	       4	64
001962017	2001	B111	8	       5	65
002000257	2001	B111	9	       6	66
002023639	2001	B111	10	       7	67
002024005	2001	B111	8	       8	68
002026692	2001	B111	8	       9	69
002043214	2001	B111	8	      10	70
002051711	2001	B555	13	       1	71
002052795	2001	B555	14             2	72
002212136	2001	B555	10	       3	73
002246657	2001	B555	8	       4	74
002256408	2001	B555	8	       5	75
002264828	2001	B555	8	       6	76
002266772	2001	B555	8	       7	77
002348422	2001	B555	8	       8	78
002403720	2001	B555	9	       9	79
002416659	2001	B555	8	      10	80

ballardw · Posted 05-23-2022 01:23 PM

So, what is your plan, or the Gov system plans, when a single SLOC exceeds 500,000 records?

Note that a maximum number of records is just one of the reasons Excel is a poor choice for interchanging data. And is the requirement actual an XLSX file or CSV? There seems to be considerable confusion in some parts of the world about CSV and "Excel file".

@arbnmedic33 wrote:

The data sets created from the original input file are to be exported to Excel and submitted into another gov system that is set up to receive only up to 500,000 records at a time (there are 1.9M observations in the input data) and the SLOC can't be split up between the new sets due to the requirements of that system. The renaming of the variables is also required by that system.

arbnmedic33 · Posted 05-23-2022 02:02 PM

If the record count exceeds 500,000 for one SLOC, it would be submitted in two or more separate tasks. The receptive system requires the .xlsx file to be accepted as well. Thanks for the question...

Tom · Posted 05-23-2022 12:00 PM

So you only care that the SLOC entries are not split? Doesn't matter if the values of the other variables are scatter across multiple files?

First get the number of observations per value of SLOC. (If that is not already one of the variables you have).

proc freq data=HAVE;
  tables sloc / noprint out=counts ;
run;

Now assign them into groups so that no group has more than 500,000.

data groups;
   if eof then call symputx('ngroups',group);
   set counts end=eof;
   retain group 0 running_count 0;
   if count + running_count > 500000 then do;
      group+1;
      running_count=0;
   end;
   running_count+count;
run;

Now you can write a data step that will spit.

So let's assume SLOC is a character variable.

filename code ;
data _null_;
   file code ;
   if _n_=1 then put "data group1-group&ngroups; set have;";
   set groups end=eof;
   by group;
   if first.group then put 'if sloc in (' @;
   put sloc :$quote. @;
   if last.group then put ') then output group' group ';' ;
   if eof then put 'run;';
run;

Now you can run that code and create the &Ngroups datasets.

%include code/ source2;

arbnmedic33 · Posted 05-23-2022 02:46 PM

Thanks Tom, appreciate the assistance...

Splitting SAS dataset by multiple varying conditions

Re: Splitting SAS dataset by multiple varying conditions

Re: Splitting SAS dataset by multiple varying conditions

Re: Splitting SAS dataset by multiple varying conditions

Re: Splitting SAS dataset by multiple varying conditions

Re: Splitting SAS dataset by multiple varying conditions

Re: Splitting SAS dataset by multiple varying conditions

Re: Splitting SAS dataset by multiple varying conditions

Re: Splitting SAS dataset by multiple varying conditions

Re: Splitting SAS dataset by multiple varying conditions

Classroom Training Available!