BookmarkSubscribeRSS Feed
novicenoice
Calcite | Level 5

I have 1 reference table and 2 different tables. The reference table holds the ID key and unique records.

 

The 2 different tables hold the same ID key column (for merging later), and other necessary columns. However, there are duplicate records with differing data per ID key per column. Examples below

 

Eventually, with these flags, I'd then merge all 3 tables and determine if a record has valid data for each column of interest and by hospital.

 

Reference table

ID keyMinutesHospital
14A
21B
38A
46C
59F

 

Table 2

ID keyScore
1 
1 
13
22
3 
41
4 
5 
5 

 

Table 3

ID keyScore 2
1Positive
2N/A
2Positive
2N/A
3Negative
3 
3Negative
4 
4 
5Non-conclusive
5Non-conclusive

 

In Tables 2 and 3, I want to create flags for

  • If the data element is numeric (Table 2), I want to create a flag (1, 0) where Score 1s are anything but missing.
  • If the data element is categorical (Table 3), I want to create a flag (1, 0) where Score 2s were recorded (ie "Negative", "Positive", "Non-conclusive"). It doesn't matter to me what the score is, just that there was a score

Table I want:

Want

ID keyMinutesHospitalScoreScore 2
14A11
21B11
38A01
46C10
59F01

 

Thoughts on how to do this? I've merged all 3 tables first but that creates other issues (eg if table 2 has 2 duplicate records and table 3 has 4 duplicate records, I'm creating 8 records). Each table actually has several numeric and categorical columns of interest.

 

For an additional challenge, some outcomes depend on two different columns across two different tables (eg if Score 1 is valid and if Score 2 is valid)...

 

Tips and thoughts appreciated, thank you!

 

5 REPLIES 5
ballardw
Super User

My first thought is that you have not defined a clear set of rules of how table 2 and table 3 (and any other table) should be combined. Your shown Idkey seems not to sufficiently identify things that should join from your limited description. Are there perhaps other values floating around such as date or particular service involved that would make combining things more of a 1 to 1 match (or 1 to none) between table 2 and 3.

 

Or are you not quite describing the issue such as you want "if ANY of the Idkey value has a 'score' "? Which would mean summarize the data to get that before combining the results.

 

Note: Reference table information at this stage isn't adding to the complexity and could be ignored as there are many ways to 1-to-1 lookup later based on the Idkey values as needed.

novicenoice
Calcite | Level 5

For this particular dataset, the reference table contains incident-specific information for each ID key. So the ID key is unique and the matching ID key in all other tables are created from that ID key. The reason for duplicate records is if Score1 or Score2, for example, have multiple entries depending on who and when the record was documented.

 

In my mind, in creating flags, my final table will be Table 1 with columns containing the flags. Maybe my approach isn't the most appropriate approach...?

ballardw
Super User

@novicenoice wrote:

For this particular dataset, the reference table contains incident-specific information for each ID key. So the ID key is unique and the matching ID key in all other tables are created from that ID key. The reason for duplicate records is if Score1 or Score2, for example, have multiple entries depending on who and when the record was documented.

 

In my mind, in creating flags, my final table will be Table 1 with columns containing the flags. Maybe my approach isn't the most appropriate approach...?


I am not sure quite where you are going with this but it sounds like your tables with the Scores should include one or more additional variables that indicate why there are duplicate Id Key values. That may be a "who" or a "when" or maybe both if the same "who"  is associated with entering or assigning the scores.  The additional identification variables should allow matching up the related score1, score2 (etc) that should be together for anything requiring both, or multiple, scores.

 

Sometimes the data is just ugly and requires more work than you expected as you dig into it.

Get the rules for how the "multiple" scores are to be combined for use though. As I said earlier, once the multiple cases are organized then add the reference data will be easy.

 

 

Quentin
Super User

This is hard to follow.  But I would suggest starting by selecting the records you want from table 2 (non-missing values for score) and de-duplicating it.  Then would select the records you want from table 3 ( Score2 IN("Negative", "Positive", "Non-conclusive") ) and de-duplicating it.  

 

Then you would have three files each with a unique value, and you can merge them together without worrying about duplicates.  If you want the flags, you can make them at that point.

Sarath_A_SAS
Obsidian | Level 7

 

/* Importing the Reference Table */
data reference_table;
    input ID_key Minutes Hospital $;
    datalines;
1 4 A
2 1 B
3 8 A
4 6 C
5 9 F
;
run;

/* Importing the Reference Table */
data table2;
    input ID_key score$;
    datalines;
ID key	Score
1	 
1	 
1	3
2	2
3	 
4	1
4	 
5	 
5	
;
run;

data table3;
    input ID_key score2$;
    datalines;

ID key	Score 2
1	Positive
2	N/A
2	Positive
2	N/A
3	Negative
3	 
3	Negative
4	 
4	 
5	Non-conclusive
5	Non-conclusive
run;


/* Step 1: Flagging numeric variable in Table 2 */
data table2_flagged;
    set table2;
    if not missing(Score) then Score_Flag = 1;
    else Score_Flag = 0;
run;

/* Step 2: Flagging categorical variable in Table 3 */
data table3_flagged;
    set table3;
    if not missing(Score2) then Score2_Flag = 1;
    else Score2_Flag = 0;
run;

/* Step 3: Aggregating flags for Table 2 */
proc sql;
    create table table2_agg as
    select ID_key, max(Score_Flag) as Score
    from table2_flagged
    group by ID_key;
quit;

/* Step 4: Aggregating flags for Table 3 */
proc sql;
    create table table3_agg as
    select ID_key, max(Score2_Flag) as Score2
    from table3_flagged
    group by ID_key;
quit;

/* Step 5: Merging all tables */
proc sql;
    create table final_table as
    select ref.ID_key, ref.Minutes, ref.Hospital,
           coalesce(t2.Score, 0) as Score, /* Default to 0 if missing */
           coalesce(t3.Score2, 0) as Score2 /* Default to 0 if missing */
    from reference_table as ref
    left join table2_agg as t2 on ref.ID_key = t2.ID_key
    left join table3_agg as t3 on ref.ID_key = t3.ID_key;
quit;

/* Step 6: Adding dependent flags (example) */
data final_table;
    set final_table;
    if Score = 1 and Score2 = 1 then Valid_Flag = 1;
    else Valid_Flag = 0;
run;

 

To tackle this problem, you can approach it step by step by:

  1. Creating flags in individual tables for numeric and categorical variables.
  2. Merging the tables carefully to ensure the duplication issue doesn't inflate the number of records.
  3. Combining the results to achieve your desired "want" table with the valid records marked by flags.

 

 

hackathon24-white-horiz.png

The 2025 SAS Hackathon has begun!

It's finally time to hack! Remember to visit the SAS Hacker's Hub regularly for news and updates.

Latest Updates

What is Bayesian Analysis?

Learn the difference between classical and Bayesian statistical approaches and see a few PROC examples to perform Bayesian analysis in this video.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 5 replies
  • 1935 views
  • 0 likes
  • 4 in conversation