BookmarkSubscribeRSS Feed
deleted_user
Not applicable
My data set,OldTable, has 6 numeric variables ID (identity number), YEAR, N1-N4.

An identity(ID) can have several records each YEAR.

For each ID and each YEAR I want to keep only one record in a new data set, NewTable.

If for a specific ID and YEAR the value of N1 equals 1 for any of the records, then the record, with the highest value for N2 should be kept, among those with N1=1.

If N1=0 for all of an ID:s records in a YEAR, then the record with the highest value for N3 should be kept.

Ex.
OldTable
ID_YEAR_ N1_ N2 _N3_ N4
11_2008__0__40__30__90
11_2008__1__30__35__50
11_2008__1__20__25__60
11_2008__0__30__30__70
11_2009__0__30__50__95
11_2009__0__10__70__92
11_2009__0__20__40__99

should give

NewTable
ID_YEAR_ N1_ N2 _N3_ N4
11_2008__1__30__35__50
11_2009__0__10__70__92

My data set contains more than 2 million records. I guess that one (or maybe a combination of) data step(s) would be faster than PROC SQL.

I would be most greatful if you could help me.
8 REPLIES 8
sbb
Lapis Lazuli | Level 10 sbb
Lapis Lazuli | Level 10
I'd suggest a combination of DATA step to filter your input, and combine in a PROC SORT along with another DATA step process with MERGE and BY to re-introduce your max-value condition mentioned.

Suggest the OP break down the requirements into individual sub-steps, to process the input file (maybe starting with a suitable sample data represented as DATALINES rather than using OBS=nnn to short-circuit input data handling). Then, after suitable time/effort to compose a SAS application program to address the object, then come back to the forum for feedback/guidance -- it may help with learning the SAS language as well as allowing future support opportunities with the resulting coded program.

Scott Barry
SBBWorks, Inc.
deleted_user
Not applicable
I already have a solution: Make two data sets, one where there are only N1=0 for each ID and YEAR. Then sort by ID, YEAR and N3 and keep the record where LAST.N3=1.

In the other data set I would have each ID and YEAR where there were records with N1=0 and N1=1. Then sort by ID, YEAR, N1 and N2 and keep the record where LAST.N2=1.

Then put the two data sets together.

If it was possible to use only one PROC SQL to get what I want, then I could accept a slower execution.

Could you say immediately that it's not possible to do everything in one PROC SQL?
sbb
Lapis Lazuli | Level 10 sbb
Lapis Lazuli | Level 10
Okay - wasn't clear from the submitted post. Glad to hear you had success.

And, to your question: No, not without investing time/effort to replicate what you have accomplished in/with a DATA and PROC SORT step processing approach, if at all possible.

Scott Barry
SBBWorks, Inc.
data_null__
Jade | Level 19
This method reads the data completely 2 times in one data step.

[pre]
data have;
infile cards dlm='_';
input ID $ YEAR N1 N2 N3 N4;
cards;
11_2008__0__40__30__90
11_2008__1__30__35__50
11_2008__1__20__25__60
11_2008__0__30__30__70
11_2009__0__30__50__95
11_2009__0__10__70__92
11_2009__0__20__40__99
;;;;
run;
proc print;
run;
data new;
do _n_ = 1 by 1 until(last.year);
set have;
by id year;
mn1 = max(mn1,n1);
if n1 eq 1 then mn2 = max(mn2,n2);
mn3 = max(mn3,n3);
end;
do _n_ = 1 to _n_;
set have;
if mn1 eq 1 then do;
if mn2 eq n2 then output;
mn1 = .;
end;
else do;
if mn3 eq n3 then output;
mn1 = .;
end;
end;
drop mn:;
run;
proc print;
run;
[/pre]
DanielSantos
Barite | Level 11
My solution assumes input data sorted by ID, YEAR and involves using two arrays. One auxiliary (temporary) array which will store the correct obs, and one associated array with the N1-N4 variables. Then you just have to "travel" through the group, check every obs. and store the correct one into the auxiliary array. At the end of the group, just output the auxiliary array.
Since there is only 4 variables to preserve, this could be also done with 4 auxiliary variables. But the array solution provides a "cleaner" coding.
[pre]
data NEWTABLE;
set OLDTABLE;
by ID YEAR;
drop I;
array _NX {4} _temporary_; * auxiliary array;
array NX {4} N1 - N4; * associated array;
* init aux array;
if first.YEAR then _NX{1}=.;
* check rule and store into auxiliary array;
if (NX{1} eq 1 and (_NX{1} ne 1 or NX{2} gt _NX{2})) or
(NX(1) eq 0 and _NX{1} ne 1 and NX{3} gt _NX{3}) then
do I=1 to 4;
_NX{I}=NX{I};
end;
if last.YEAR; * last element in group;
do i=1 to 4;
NX{i}=_NX{i};
end; * restore data from auxiliary array;
run;
[/pre]
Cheers from Portugal.

Daniel Santos @ www.cgd.pt
Flip
Fluorite | Level 6
Unless I am missing something the SQL solution is simple.
The max of N1 being 0 would mean all obs are 0, it being 1 would mean at least one was 1. The only problem I see with this solution is in the case of duplicate values of N3 or N4.


proc sql;
create table new as select a.id, a.year, n1,n2,n3,n4 from
( select *,
max(n1) as m1, max(n2) as m2, max(n3) as m3
from
have
group by id, year) a
left join
(select id, year, max(n2) as o2 from (select * from have where n1 = 1) group by id, year )b
on a.id = b.id and a.year = b.year

where (a.m1 = 1 and a.n1 = a.m1 and a.n2 = b.o2) or (a.m1= 0 and a.n3=a.m3)
;
quit;
Oops missing one thing here


Message was edited by: Flip This should do it, I missed the apart about N1 = 1 for the solution.


Message was edited by: Flip
deleted_user
Not applicable
Thanks to everyone.

Apart from the earlier question: In a new table I want an individuals all records for each year, if the individual have at least one record that year where N1 equals 1.

I know how to do it, if I didn't have to regard the different years:

select * from oldtable where id in (select distinct id from oldtable where n1=1);

How do I solve the year problem?
sbb
Lapis Lazuli | Level 10 sbb
Lapis Lazuli | Level 10
Suggest top OP: create a new post with each new query - paste a backward link to prior post(s) as needed.

Given the need to create a separate file for each year, one method is to develop/code a SAS macro which generates suitable PROC SQL code with WHERE clause to filter on your time-period variable, iterating for each unique year-portion value.

Scott Barry
SBBWorks, Inc.

sas-innovate-2024.png

Join us for SAS Innovate April 16-19 at the Aria in Las Vegas. Bring the team and save big with our group pricing for a limited time only.

Pre-conference courses and tutorials are filling up fast and are always a sellout. Register today to reserve your seat.

 

Register now!

What is Bayesian Analysis?

Learn the difference between classical and Bayesian statistical approaches and see a few PROC examples to perform Bayesian analysis in this video.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 8 replies
  • 859 views
  • 0 likes
  • 5 in conversation