BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
soumri
Quartz | Level 8

Hi,
It's a somewhat complicated question I'm going to ask here, but many of my analyzes depend heavily on it. Your help would be, therefore, very precious for me.

In fact, I have a database with over than 700,000 observations, that seems to this:

GROUP ID      RANK  TD        YEAR SEASON GYS       Value
LH84  B000001 1     28MAR2011 2011 1      LH8420111 11.2
LH84  B000001 1     30MAY2011 2011 2      LH8420111  9.6
LH84  B000001 1     27JUN2011 2011 3      LH8420113  7.8
LH84  B000001 1     01AUG2011 2011 3      LH8420113  7.2
LH84  B000001 2     09FEB2012 2012 1      LH8420121 19.3
LH84  B000001 2     10APR2012 2012 2      LH8420122 20.6
LH84  B000002 1     10APR2012 2012 2      LH8420122  9.4
LH84  B000002 1     05JUN2012 2012 3      LH8420123 10.9
LH84  B000002 1     14AUG2012 2012 3      LH8420123  8.7
KC01  B000013 4     18JUN2000 2000 3      KC0120003  9.6
KC01  B000013 4     14AUG2000 2000 3      KC0120003  9.2
KC01  B000013 4     14OCT2000 2000 4      KC0120004  7.2
etc...

With:

- TD is a test date,

- Year is the year of TD (extracted from the TD variable)

- Season is the season of the TD (Months from the test date were used to create the season variable).

- And, GYS is a composite variable, created from GROUP, YEAR and SEASON variables 

 

* In my raw database there are almost 1100 different GROUPS, more than 45,000 different IDs, 5 different RANK, 15 Years (from 2000 to 2014) and 4 SEASONS.

* Each ID can have up to 11 different tests for a particular RANK 

 

How do I to create another new one (while keeping the same variables), which fulfills the following conditions?

- Only IDs with at least 3 tests for a particular RANK are considered.
- Only GROUPS that contain at least 5 different IDs in a given YEAR are considered;
- For each class of GYS there are at least 4 observations.

The sample I gave is very small and can not be used to test any code, but I could provide you with a larger sample if the need arises.
My best thanks.

1 ACCEPTED SOLUTION

Accepted Solutions
DarthPathos
Lapis Lazuli | Level 10

Hi @soumri 

It's not elegant, but it works (as we've discussed over email).  This is more step-by-step than I would normally do it but I did it this way intentionally so that you (and others) would understand what I was doing. 

Looking forward to seeing your paper!  Please reach out should you need anything further.
Chris

 

/* 
Code written and tested on SAS University Edition, running on a MacBook Pro and VirtualBox.  
SAS UE is run on Chrome.

Requirements: 1) A minimum of 4 and a maximum of 11 observations for each identifier in a given rank (already verified) AND 2) Each group (GRP) for a given YTD must contain at least 5 different IDs. AND 3) Each class of GYS must contain at least 4 observations. Description of the code: 1 - 3: Take the requirements above and create specific temporary tables with the data that meet the criteria 4: Take the temp table that I have created in 1 and create a new table where the IDs from Table1 match those in our original table 5: Take the new table created in 4 and match it with the original data by GRP and YTD 6: Take the new table created in 5 and match it with the original data by ID Results: This ends up with a dataset (n = 525) that match all three of your criteria.

Addition: Step 7 - Remerged the 525 observations back with the original dataset and create a
flag to indicate a match. Notes: Change "work.import" to whatever library.table_name you have. Once you've made that change, the remainder of the code should work. I've tested this three times and get the same results. */ /* Import code %web_drop_table(WORK.IMPORT); FILENAME REFFILE '/folders/myfolders/dataset.csv'; PROC IMPORT DATAFILE=REFFILE DBMS=CSV OUT=WORK.IMPORT; GETNAMES=YES; RUN; PROC CONTENTS DATA=WORK.IMPORT; RUN; %web_open_table(WORK.IMPORT); */ proc sql; /* 1 */ create table work.table1 as select ID, count(*) as Cnt from work.import group by ID having count(*)>=4 and count(*)<=11 order by ID; /* 2 */ create table work.table2a as select distinct grp, ytd, id from work.import; create table work.table2 as select grp, ytd, count(*) as Cnt from work.table2a group by grp, ytd having count(*)>=5 order by grp, ytd; /* 3 */ create table work.table3 as select gys, count(*) as Cnt from work.import group by gys having count(*)>=4 order by gys; /* 4 */ create table work.table5 as select a.* from work.import a, work.table3 b where a.gys = b.gys; /* 5 */ create table work.table6 as select a.* from work.import a, work.table2 b where a.grp = b.grp and a.ytd = b.ytd; /* 6 */ create table work.table7 as select a.* from work.import a, work.table1 b where a.id = b.id; quit; /* 7 */ proc sql; create table work.final as select a.*, case when b.id is not null then '1' else '0' end as Match from work.import a left join work.table7 b /* LEFT Join = ALL records from the original table and the records that match from TABLE7 */ on a.grp = b.grp and a.gys = b.gys and a.id = b.id; select * from work.final order by Match desc; quit;

 

Has my article or post helped? Please mark as Solution or Like the article!

View solution in original post

10 REPLIES 10
DarthPathos
Lapis Lazuli | Level 10

hi @soumri

 

Without actually testing this I think the easiest would be to run a couple of PROC SQL queries that would generate tables and then update your dataset accordingly.

 

- Only IDs with at least 3 tests for a particular RANK are considered.

PROC SQL;
create table id_count as select ID, count(test_variable) as Cnt from table_name where rank = "1" group by ID order by ID; QUIT:

- Only GROUPS that contain at least 5 different IDs in a given YEAR are considered;

PROC SQL;
select distinct group, ID, year
into #Temp_Table
from table_name;

create table grp_cnt as select group, year, count(*) as Cnt from #Temp_Table group by group, year having count(*)>=5 order by group, year; QUIT;

- For each class of GYS there are at least 4 observations.

 
PROC SQL;
select distinct gys, ID
into #Temp_Table2
from table_name
group by gys, ID
order by gys, ID;

create table gys_cnt as select gys, count(*) as Cnt from #Temp_Table2 group by gys having count(*)>=4 order by gys; QUIT;

You should be able to then take the results from your tables and generate new variables etc. as needed (select a.* , b.* from main_table a left join gys_cnt b on a.id = b.id or something similar).

Hope this helps!

Chris

Has my article or post helped? Please mark as Solution or Like the article!
soumri
Quartz | Level 8

Thank you for paying attention to my request.

I see that the first part of the code ( just a counter, which I have already tried), should be changed by this one:

PROC SQL;
create table id_count as
select ID, count (test_variable) as Cnt
from table_name
group by ID, rank
order by ID, Rank, TD;
QUIT;


for the second and third condition, they must not be separated to accomplish them together. Treating them separately leads to false results.
moreover, for the second condition, an error is generated which I do not see where is the problem:

16558 select distinct group, ID, year into Temp_Table
                                                                  -------
                                                                  79
ERROR 79-322: Expecting a:.

 

Thanks again!

DarthPathos
Lapis Lazuli | Level 10

Hm - OK, I don't have access to SAS right now but I will try to take a closer look when I'm home later today.  In the meantime, take a look at the following links as they may help:

https://communities.sas.com/t5/General-SAS-Programming/Creating-a-conditional-Variable/td-p/295677

https://stats.idre.ucla.edu/sas/modules/creating-and-recoding-variables-in-sas/

https://communities.sas.com/t5/SAS-Programming/Creating-a-conditional-count-variable/td-p/181598

 

Good luck and post back if you figure anything out, or if you have additional questions 🙂

Chris

Has my article or post helped? Please mark as Solution or Like the article!
soumri
Quartz | Level 8

thank you 

 

DarthPathos
Lapis Lazuli | Level 10

Arrays and DO Loops are extremely powerful...and really confusing if you've never used them.  I'll see what I can do tonight / this weekend and will post back, maybe a subset of the live data will help but I'll let you know.  

Chris

Has my article or post helped? Please mark as Solution or Like the article!
DarthPathos
Lapis Lazuli | Level 10

Good morning @soumri 

I've done some tests on some dummy data I created based on your original post.  I think I have something figured out but if you can post your dataset, that would be fantastic - I don't want to post anything here that won't help you.  Once I've had a chance to run through your dataset, I'll be able to confirm my code and will then post back here. 

 

I'll need a day or two after you send me the data; hopefully that's OK. 

 

Chris

Has my article or post helped? Please mark as Solution or Like the article!
soumri
Quartz | Level 8

Hi ,
Nice to hear from you. Is it possible to send you my dataset privately and discuss exactly what I need with variables exactly as they appear, then we'll publish the solution for the whole SAS community with just the example I've sent.

DarthPathos
Lapis Lazuli | Level 10

If you can enable Private Messages in your SAS Communities profile, I will message you with my email address.

Chris

Has my article or post helped? Please mark as Solution or Like the article!
DarthPathos
Lapis Lazuli | Level 10

Hi @soumri 

It's not elegant, but it works (as we've discussed over email).  This is more step-by-step than I would normally do it but I did it this way intentionally so that you (and others) would understand what I was doing. 

Looking forward to seeing your paper!  Please reach out should you need anything further.
Chris

 

/* 
Code written and tested on SAS University Edition, running on a MacBook Pro and VirtualBox.  
SAS UE is run on Chrome.

Requirements: 1) A minimum of 4 and a maximum of 11 observations for each identifier in a given rank (already verified) AND 2) Each group (GRP) for a given YTD must contain at least 5 different IDs. AND 3) Each class of GYS must contain at least 4 observations. Description of the code: 1 - 3: Take the requirements above and create specific temporary tables with the data that meet the criteria 4: Take the temp table that I have created in 1 and create a new table where the IDs from Table1 match those in our original table 5: Take the new table created in 4 and match it with the original data by GRP and YTD 6: Take the new table created in 5 and match it with the original data by ID Results: This ends up with a dataset (n = 525) that match all three of your criteria.

Addition: Step 7 - Remerged the 525 observations back with the original dataset and create a
flag to indicate a match. Notes: Change "work.import" to whatever library.table_name you have. Once you've made that change, the remainder of the code should work. I've tested this three times and get the same results. */ /* Import code %web_drop_table(WORK.IMPORT); FILENAME REFFILE '/folders/myfolders/dataset.csv'; PROC IMPORT DATAFILE=REFFILE DBMS=CSV OUT=WORK.IMPORT; GETNAMES=YES; RUN; PROC CONTENTS DATA=WORK.IMPORT; RUN; %web_open_table(WORK.IMPORT); */ proc sql; /* 1 */ create table work.table1 as select ID, count(*) as Cnt from work.import group by ID having count(*)>=4 and count(*)<=11 order by ID; /* 2 */ create table work.table2a as select distinct grp, ytd, id from work.import; create table work.table2 as select grp, ytd, count(*) as Cnt from work.table2a group by grp, ytd having count(*)>=5 order by grp, ytd; /* 3 */ create table work.table3 as select gys, count(*) as Cnt from work.import group by gys having count(*)>=4 order by gys; /* 4 */ create table work.table5 as select a.* from work.import a, work.table3 b where a.gys = b.gys; /* 5 */ create table work.table6 as select a.* from work.import a, work.table2 b where a.grp = b.grp and a.ytd = b.ytd; /* 6 */ create table work.table7 as select a.* from work.import a, work.table1 b where a.id = b.id; quit; /* 7 */ proc sql; create table work.final as select a.*, case when b.id is not null then '1' else '0' end as Match from work.import a left join work.table7 b /* LEFT Join = ALL records from the original table and the records that match from TABLE7 */ on a.grp = b.grp and a.gys = b.gys and a.id = b.id; select * from work.final order by Match desc; quit;

 

Has my article or post helped? Please mark as Solution or Like the article!
soumri
Quartz | Level 8

@DarthPathos 

Yes Chris, it works very well. Thank you for this valuable collaboration.

SAS Innovate 2025: Register Now

Registration is now open for SAS Innovate 2025 , our biggest and most exciting global event of the year! Join us in Orlando, FL, May 6-9.
Sign up by Dec. 31 to get the 2024 rate of just $495.
Register now!

Mastering the WHERE Clause in PROC SQL

SAS' Charu Shankar shares her PROC SQL expertise by showing you how to master the WHERE clause using real winter weather data.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 10 replies
  • 2793 views
  • 3 likes
  • 2 in conversation