Solved: Observation selection

soumri · Posted 11-15-2019 08:10 AM

Hi,
It's a somewhat complicated question I'm going to ask here, but many of my analyzes depend heavily on it. Your help would be, therefore, very precious for me.

In fact, I have a database with over than 700,000 observations, that seems to this:

GROUP ID RANK TD YEAR SEASON GYS Value
LH84  B000001 1     28MAR2011 2011 1 LH8420111 11.2
LH84  B000001 1     30MAY2011 2011 2 LH8420111 9.6
LH84  B000001 1     27JUN2011 2011 3 LH8420113 7.8
LH84  B000001 1     01AUG2011 2011 3 LH8420113 7.2
LH84  B000001 2     09FEB2012 2012 1      LH8420121 19.3
LH84  B000001 2     10APR2012 2012 2      LH8420122 20.6
LH84  B000002 1     10APR2012 2012 2      LH8420122 9.4
LH84  B000002 1     05JUN2012 2012 3      LH8420123 10.9
LH84 B000002 1 14AUG2012 2012 3 LH8420123 8.7
KC01 B000013 4 18JUN2000 2000 3 KC0120003 9.6
KC01 B000013 4 14AUG2000 2000 3 KC0120003 9.2
KC01 B000013 4 14OCT2000 2000 4 KC0120004 7.2
etc...

With:

- TD is a test date,

- Year is the year of TD (extracted from the TD variable)

- Season is the season of the TD (Months from the test date were used to create the season variable).

- And, GYS is a composite variable, created from GROUP, YEAR and SEASON variables

* In my raw database there are almost 1100 different GROUPS, more than 45,000 different IDs, 5 different RANK, 15 Years (from 2000 to 2014) and 4 SEASONS.

* Each ID can have up to 11 different tests for a particular RANK

How do I to create another new one (while keeping the same variables), which fulfills the following conditions?

- Only IDs with at least 3 tests for a particular RANK are considered.
- Only GROUPS that contain at least 5 different IDs in a given YEAR are considered;
- For each class of GYS there are at least 4 observations.

The sample I gave is very small and can not be used to test any code, but I could provide you with a larger sample if the need arises.
My best thanks.

DarthPathos · Posted 11-27-2019 08:54 AM

Hi @soumri

It's not elegant, but it works (as we've discussed over email). This is more step-by-step than I would normally do it but I did it this way intentionally so that you (and others) would understand what I was doing.

Looking forward to seeing your paper! Please reach out should you need anything further.
Chris

/* 
Code written and tested on SAS University Edition, running on a MacBook Pro and VirtualBox.  
SAS UE is run on Chrome.

Requirements:
1) A minimum of 4 and a maximum of 11 observations for each identifier in a given rank (already verified) AND
2) Each group (GRP) for a given YTD must contain at least 5 different IDs. AND
3) Each class of GYS must contain at least 4 observations.

Description of the code:
1 - 3: Take the requirements above and create specific temporary tables with the data that meet the criteria
4: Take the temp table that I have created in 1 and create a new table where the IDs from Table1
   match those in our original table
5: Take the new table created in 4 and match it with the original data by GRP and YTD
6: Take the new table created in 5 and match it with the original data by ID

Results:
This ends up with a dataset (n = 525) that match all three of your criteria.

Addition: Step 7 - Remerged the 525 observations back with the original dataset and create a 
flag to indicate a match.

Notes: Change "work.import" to whatever library.table_name you have.  Once you've made that change, 
  the remainder of the code should work.  I've tested this three times and get the same results.
*/

/* Import code 
%web_drop_table(WORK.IMPORT);
FILENAME REFFILE '/folders/myfolders/dataset.csv';
PROC IMPORT DATAFILE=REFFILE
	DBMS=CSV
	OUT=WORK.IMPORT;
	GETNAMES=YES;
RUN;
PROC CONTENTS DATA=WORK.IMPORT; RUN;
%web_open_table(WORK.IMPORT);
*/

proc sql; 
/* 1 */ 
create table work.table1 as 
select ID, count(*) as Cnt
from work.import
group by ID
having count(*)>=4 and count(*)<=11
order by ID;

/* 2 */
create table work.table2a as 
select distinct grp, ytd, id
from work.import;

create table work.table2 as
select grp, ytd, count(*) as Cnt 
from work.table2a
group by grp, ytd
having count(*)>=5
order by grp, ytd;

/* 3 */
create table work.table3 as 
select gys, count(*) as Cnt
from work.import
group by gys
having count(*)>=4
order by gys;

/* 4 */
create table work.table5 as 
select a.* 
from work.import a, work.table3 b 
where a.gys = b.gys;

/* 5 */
create table work.table6 as 
select a.*
from work.import a, work.table2 b
where a.grp = b.grp
and a.ytd = b.ytd;

/* 6 */
create table work.table7 as 
select a.*
from work.import a, work.table1 b
where a.id = b.id;
quit;

/* 7 */
proc sql;
create table work.final as 
select a.*, 
case when b.id is not null then '1' else '0' end as Match
from work.import a left join work.table7 b  
/* LEFT Join = ALL records from the original table and the records that match from TABLE7 */
 on a.grp = b.grp
 and a.gys = b.gys 
 and a.id = b.id;
 
 select * from work.final order by Match desc;
 quit;

Has my article or post helped? Please mark as Solution or Like the article!

View solution in original post

DarthPathos · Posted 11-15-2019 08:24 AM

hi @soumri -

Without actually testing this I think the easiest would be to run a couple of PROC SQL queries that would generate tables and then update your dataset accordingly.

- Only IDs with at least 3 tests for a particular RANK are considered.

PROC SQL;
create table id_count as 
select ID, count(test_variable) as Cnt
from table_name
where rank = "1"
group by ID
order by ID;
QUIT:

- Only GROUPS that contain at least 5 different IDs in a given YEAR are considered;

PROC SQL;
select distinct group, ID, year
into #Temp_Table
from table_name;

create table grp_cnt as
select group, year, count(*) as Cnt
from #Temp_Table
group by group, year
having count(*)>=5
order by group, year; 
QUIT;

- For each class of GYS there are at least 4 observations.

 
PROC SQL;
select distinct gys, ID
into #Temp_Table2
from table_name
group by gys, ID
order by gys, ID;

create table gys_cnt as  
select gys, count(*) as Cnt
from #Temp_Table2 
group by gys
having count(*)>=4
order by gys;
QUIT;

You should be able to then take the results from your tables and generate new variables etc. as needed (select a.* , b.* from main_table a left join gys_cnt b on a.id = b.id or something similar).

Hope this helps!

Chris

Has my article or post helped? Please mark as Solution or Like the article!

soumri · Posted 11-15-2019 09:27 AM

Thank you for paying attention to my request.

I see that the first part of the code ( just a counter, which I have already tried), should be changed by this one:

PROC SQL;
create table id_count as
select ID, count (test_variable) as Cnt
from table_name
group by ID, rank
order by ID, Rank, TD;
QUIT;

for the second and third condition, they must not be separated to accomplish them together. Treating them separately leads to false results.
moreover, for the second condition, an error is generated which I do not see where is the problem:

16558 select distinct group, ID, year into Temp_Table
-------
79
ERROR 79-322: Expecting a:.

Thanks again!

DarthPathos · Posted 11-15-2019 09:34 AM

Hm - OK, I don't have access to SAS right now but I will try to take a closer look when I'm home later today. In the meantime, take a look at the following links as they may help:

https://communities.sas.com/t5/General-SAS-Programming/Creating-a-conditional-Variable/td-p/295677

https://stats.idre.ucla.edu/sas/modules/creating-and-recoding-variables-in-sas/

https://communities.sas.com/t5/SAS-Programming/Creating-a-conditional-count-variable/td-p/181598

Good luck and post back if you figure anything out, or if you have additional questions 🙂

Chris

Has my article or post helped? Please mark as Solution or Like the article!

soumri · Posted 11-15-2019 09:44 AM

thank you DarthPathos,

I could perhaps send you my dataset with the real variables, if tests are needed.
A colleague told me (because it sounds like a loop maybe) that I have to use ARRAYs and DO Loops !!!! I do not know!
I will remain at your disposal for more details.

DarthPathos · Posted 11-15-2019 09:49 AM

Arrays and DO Loops are extremely powerful...and really confusing if you've never used them. I'll see what I can do tonight / this weekend and will post back, maybe a subset of the live data will help but I'll let you know.

Chris

Has my article or post helped? Please mark as Solution or Like the article!

DarthPathos · Posted 11-18-2019 07:50 AM

Good morning @soumri

I've done some tests on some dummy data I created based on your original post. I think I have something figured out but if you can post your dataset, that would be fantastic - I don't want to post anything here that won't help you. Once I've had a chance to run through your dataset, I'll be able to confirm my code and will then post back here.

I'll need a day or two after you send me the data; hopefully that's OK.

Chris

Has my article or post helped? Please mark as Solution or Like the article!

soumri · Posted 11-18-2019 08:06 AM

Hi DarthPathos,
Nice to hear from you. Is it possible to send you my dataset privately and discuss exactly what I need with variables exactly as they appear, then we'll publish the solution for the whole SAS community with just the example I've sent.

DarthPathos · Posted 11-18-2019 08:10 AM

If you can enable Private Messages in your SAS Communities profile, I will message you with my email address.

Chris

Has my article or post helped? Please mark as Solution or Like the article!

DarthPathos · Posted 11-27-2019 08:54 AM

Hi @soumri

It's not elegant, but it works (as we've discussed over email). This is more step-by-step than I would normally do it but I did it this way intentionally so that you (and others) would understand what I was doing.

Looking forward to seeing your paper! Please reach out should you need anything further.
Chris

/* 
Code written and tested on SAS University Edition, running on a MacBook Pro and VirtualBox.  
SAS UE is run on Chrome.

Requirements:
1) A minimum of 4 and a maximum of 11 observations for each identifier in a given rank (already verified) AND
2) Each group (GRP) for a given YTD must contain at least 5 different IDs. AND
3) Each class of GYS must contain at least 4 observations.

Description of the code:
1 - 3: Take the requirements above and create specific temporary tables with the data that meet the criteria
4: Take the temp table that I have created in 1 and create a new table where the IDs from Table1
   match those in our original table
5: Take the new table created in 4 and match it with the original data by GRP and YTD
6: Take the new table created in 5 and match it with the original data by ID

Results:
This ends up with a dataset (n = 525) that match all three of your criteria.

Addition: Step 7 - Remerged the 525 observations back with the original dataset and create a 
flag to indicate a match.

Notes: Change "work.import" to whatever library.table_name you have.  Once you've made that change, 
  the remainder of the code should work.  I've tested this three times and get the same results.
*/

/* Import code 
%web_drop_table(WORK.IMPORT);
FILENAME REFFILE '/folders/myfolders/dataset.csv';
PROC IMPORT DATAFILE=REFFILE
	DBMS=CSV
	OUT=WORK.IMPORT;
	GETNAMES=YES;
RUN;
PROC CONTENTS DATA=WORK.IMPORT; RUN;
%web_open_table(WORK.IMPORT);
*/

proc sql; 
/* 1 */ 
create table work.table1 as 
select ID, count(*) as Cnt
from work.import
group by ID
having count(*)>=4 and count(*)<=11
order by ID;

/* 2 */
create table work.table2a as 
select distinct grp, ytd, id
from work.import;

create table work.table2 as
select grp, ytd, count(*) as Cnt 
from work.table2a
group by grp, ytd
having count(*)>=5
order by grp, ytd;

/* 3 */
create table work.table3 as 
select gys, count(*) as Cnt
from work.import
group by gys
having count(*)>=4
order by gys;

/* 4 */
create table work.table5 as 
select a.* 
from work.import a, work.table3 b 
where a.gys = b.gys;

/* 5 */
create table work.table6 as 
select a.*
from work.import a, work.table2 b
where a.grp = b.grp
and a.ytd = b.ytd;

/* 6 */
create table work.table7 as 
select a.*
from work.import a, work.table1 b
where a.id = b.id;
quit;

/* 7 */
proc sql;
create table work.final as 
select a.*, 
case when b.id is not null then '1' else '0' end as Match
from work.import a left join work.table7 b  
/* LEFT Join = ALL records from the original table and the records that match from TABLE7 */
 on a.grp = b.grp
 and a.gys = b.gys 
 and a.id = b.id;
 
 select * from work.final order by Match desc;
 quit;

Has my article or post helped? Please mark as Solution or Like the article!

soumri · Posted 11-28-2019 05:50 AM

@DarthPathos

Yes Chris, it works very well. Thank you for this valuable collaboration.

Observation selection

Re: Observation selection

Re: Observation selection

Re: Observation selection

Re: Observation selection

Re: Observation selection

Re: Observation selection

Re: Observation selection

Re: Observation selection

Re: Observation selection

Re: Observation selection

Re: Observation selection