Greetings, first time poster, novice/intermediate user. I have a SAS data set consisting of the following variables:
NIIN PLANT SLOC replen2 COUNT NUM
NIIN is the part number, all distinct entries
PLANT is the source, all the same (2001)
SLOC is the responsible entity for that NIIN (271 distinct entries)
COUNT is the numeric count of individual SLOC in the file (i.e. SLOC ABCD has 76,890 entries, EFGH has 250, etc.)
NUM is the count from 1 to total observations in the file (1,922,802)
From this data, I need to create SAS data sets based on the following criteria:
Total NUM in the new datasets can't exceed 500,000 per data set
The SLOC entry can't be divided between the data sets (all ABCD must go into one data set...while not exceeding the 500,000 obs limit per set)
The issue is when I extract the first 500,000 obs, it falls right in the middle of one of the SLOCs.
How do I configure SAS to create datasets based on the criteria mentioned above?
The input file (NIIN PLANT SLOC replen2 COUNT NUM) will change each time I run the program.
Thanks in advance for any assistance.
So you only care that the SLOC entries are not split? Doesn't matter if the values of the other variables are scatter across multiple files?
First get the number of observations per value of SLOC. (If that is not already one of the variables you have).
proc freq data=HAVE;
tables sloc / noprint out=counts ;
run;
Now assign them into groups so that no group has more than 500,000.
data groups;
if eof then call symputx('ngroups',group);
set counts end=eof;
retain group 0 running_count 0;
if count + running_count > 500000 then do;
group+1;
running_count=0;
end;
running_count+count;
run;
Now you can write a data step that will spit.
So let's assume SLOC is a character variable.
filename code ;
data _null_;
file code ;
if _n_=1 then put "data group1-group&ngroups; set have;";
set groups end=eof;
by group;
if first.group then put 'if sloc in (' @;
put sloc :$quote. @;
if last.group then put ') then output group' group ';' ;
if eof then put 'run;';
run;
Now you can run that code and create the &Ngroups datasets.
%include code/ source2;
This is really hard to understand. Can you create a small example, let's say 50 observations, that illustrates what you want? Show us the example data as SAS data step code, which you can type in yourself of by following these instructions (and not via other methods); and then show us the desired output.
Can you describe why exactly you need to create different data sets? And possibly why the 500,000 record limit?
For use in SAS it is usually much easier to leave the data in one set and use either a BY statement to process groups of related records based on values of the variables or a WHERE statement to select specific records for a particular use.
This becomes even more of an issue if, as you say, the records change constantly, i.e. each time you run the program.
The data sets created from the original input file are to be exported to Excel and submitted into another gov system that is set up to receive only up to 500,000 records at a time (there are 1.9M observations in the input data) and the SLOC can't be split up between the new sets due to the requirements of that system. The renaming of the variables is also required by that system. Code is (so far):
DATA A; INFILE 'C:\Users\ra\Documents\My SAS Files\SAS 9.4 Data\text\pdt_output_trimmed.txt'
TRUNCOVER FIRSTOBS = 2;
input @1 NIIN $9.
@11 PLANT $4.
@16 SLOC $4.
@20 PDT $3.
;
RUN;
DATA B; SET A; if PLANT = '2001';
RENAME NIIN = niin PLANT = plant SLOC = sloc PDT = replen2;
PROC SORT DATA = B; BY sloc NIIN;
RUN;
DATA C; SET B; BY sloc NIIN;
IF FIRST.sloc THEN COUNT = 1; ELSE COUNT = COUNT +1; RETAIN COUNT;
NUM+1;
RUN;
The COUNT and NUM are my initial attempts to somehow identify the data for the subsets. The splits for the new datasets must occur between the different SLOCs and must not exceed the 500,000 obs limit. Problem is when I go to create the new data sets based on the number of observations, the splits occur in the middle of the SLOC, splitting that SLOC between two new SAS data sets. Sorry of this seems convoluted, difficult to explain. Just getting back into writing code after a few years away.
Example of the input data is as follows: (actual input .txt file has 1.9M obs) and as mentioned before, this initial input data will change each month.
NIIN PLANT SLOC replen2 COUNT NUM
013337632 2001 A011 25 1 1
013569098 2001 A011 25 2 2
014771660 2001 A011 25 3 3
3014935859 2001 A011 25 4 4
015004619 2001 A011 25 5 5
015062715 2001 A011 25 6 6
015064129 2001 A011 25 7 7
015064131 2001 A011 25 8 8
015064133 2001 A011 25 9 9
015066873 2001 A011 25 10 10
000000079 2001 A091 8 1 11
000045256 2001 A091 8 2 12
000050753 2001 A091 9 3 13
000103030 2001 A091 8 4 14
000115093 2001 A091 8 5 15
000146763 2001 A091 8 6 16
000179551 2001 A091 8 7 17
000182296 2001 A091 8 8 18
000187988 2001 A091 8 9 19
000187989 2001 A091 8 10 20
000599264 2001 A095 8 1 21
000680509 2001 A095 8 2 22
000686654 2001 A095 8 3 23
000712075 2001 A095 8 4 24
000712513 2001 A095 8 5 25
000712515 2001 A095 8 6 26
000811577 2001 A095 8 7 27
000813381 2001 A095 9 8 28
000847436 2001 A095 8 9 29
000881251 2001 A095 9 10 30
000888863 2001 A097 8 1 31
000889167 2001 A097 8 2 32
001000643 2001 A097 9 3 33
001003095 2001 A097 8 4 34
001005920 2001 A097 8 5 35
001005937 2001 A097 8 6 36
001045143 2001 A097 11 7 37
001116255 2001 A097 9 8 38
001217929 2001 A097 8 9 39
001275322 2001 A097 8 10 40
001278684 2001 A099 11 1 41
001304235 2001 A099 8 2 42
001319915 2001 A099 8 3 43
001325317 2001 A099 8 4 44
001349098 2001 A099 9 5 45
001423036 2001 A099 8 6 46
001433159 2001 A099 8 7 47
001440091 2001 A099 22 8 48
001503838 2001 A099 9 9 49
001513115 2001 A099 9 10 50
001528353 2001 A123 8 1 51
001648881 2001 A123 8 2 52
001668567 2001 A123 9 3 53
001670804 2001 A123 8 4 54
001691934 2001 A123 9 5 55
001691935 2001 A123 8 6 56
001720031 2001 A132 8 7 57
001727223 2001 A123 9 8 58
001731881 2001 A123 8 9 59
001744311 2001 A123 8 10 60
001805038 2001 B111 8 1 61
001827475 2001 B111 9 2 62
001856345 2001 B111 9 3 63
001879528 2001 B111 8 4 64
001962017 2001 B111 8 5 65
002000257 2001 B111 9 6 66
002023639 2001 B111 10 7 67
002024005 2001 B111 8 8 68
002026692 2001 B111 8 9 69
002043214 2001 B111 8 10 70
002051711 2001 B555 13 1 71
002052795 2001 B555 14 2 72
002212136 2001 B555 10 3 73
002246657 2001 B555 8 4 74
002256408 2001 B555 8 5 75
002264828 2001 B555 8 6 76
002266772 2001 B555 8 7 77
002348422 2001 B555 8 8 78
002403720 2001 B555 9 9 79
002416659 2001 B555 8 10 80
So, what is your plan, or the Gov system plans, when a single SLOC exceeds 500,000 records?
Note that a maximum number of records is just one of the reasons Excel is a poor choice for interchanging data. And is the requirement actual an XLSX file or CSV? There seems to be considerable confusion in some parts of the world about CSV and "Excel file".
@arbnmedic33 wrote:
The data sets created from the original input file are to be exported to Excel and submitted into another gov system that is set up to receive only up to 500,000 records at a time (there are 1.9M observations in the input data) and the SLOC can't be split up between the new sets due to the requirements of that system. The renaming of the variables is also required by that system.
So you only care that the SLOC entries are not split? Doesn't matter if the values of the other variables are scatter across multiple files?
First get the number of observations per value of SLOC. (If that is not already one of the variables you have).
proc freq data=HAVE;
tables sloc / noprint out=counts ;
run;
Now assign them into groups so that no group has more than 500,000.
data groups;
if eof then call symputx('ngroups',group);
set counts end=eof;
retain group 0 running_count 0;
if count + running_count > 500000 then do;
group+1;
running_count=0;
end;
running_count+count;
run;
Now you can write a data step that will spit.
So let's assume SLOC is a character variable.
filename code ;
data _null_;
file code ;
if _n_=1 then put "data group1-group&ngroups; set have;";
set groups end=eof;
by group;
if first.group then put 'if sloc in (' @;
put sloc :$quote. @;
if last.group then put ') then output group' group ';' ;
if eof then put 'run;';
run;
Now you can run that code and create the &Ngroups datasets.
%include code/ source2;
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.