Re: How do I flag across different tables with duplicates for numeric ...

novicenoice · Posted 06-07-2024 11:36 AM

I have 1 reference table and 2 different tables. The reference table holds the ID key and unique records.

The 2 different tables hold the same ID key column (for merging later), and other necessary columns. However, there are duplicate records with differing data per ID key per column. Examples below

Eventually, with these flags, I'd then merge all 3 tables and determine if a record has valid data for each column of interest and by hospital.

Reference table

ID key	Minutes	Hospital
1	4	A
2	1	B
3	8	A
4	6	C
5	9	F

Table 2

ID key	Score
1
1
1	3
2	2
3
4	1
4
5
5

Table 3

ID key	Score 2
1	Positive
2	N/A
2	Positive
2	N/A
3	Negative
3
3	Negative
4
4
5	Non-conclusive
5	Non-conclusive

In Tables 2 and 3, I want to create flags for

If the data element is numeric (Table 2), I want to create a flag (1, 0) where Score 1s are anything but missing.
If the data element is categorical (Table 3), I want to create a flag (1, 0) where Score 2s were recorded (ie "Negative", "Positive", "Non-conclusive"). It doesn't matter to me what the score is, just that there was a score

Table I want:

Want

ID key	Minutes	Hospital	Score	Score 2
1	4	A	1	1
2	1	B	1	1
3	8	A	0	1
4	6	C	1	0
5	9	F	0	1

Thoughts on how to do this? I've merged all 3 tables first but that creates other issues (eg if table 2 has 2 duplicate records and table 3 has 4 duplicate records, I'm creating 8 records). Each table actually has several numeric and categorical columns of interest.

For an additional challenge, some outcomes depend on two different columns across two different tables (eg if Score 1 is valid and if Score 2 is valid)...

Tips and thoughts appreciated, thank you!

ballardw · Posted 06-07-2024 11:49 AM

My first thought is that you have not defined a clear set of rules of how table 2 and table 3 (and any other table) should be combined. Your shown Idkey seems not to sufficiently identify things that should join from your limited description. Are there perhaps other values floating around such as date or particular service involved that would make combining things more of a 1 to 1 match (or 1 to none) between table 2 and 3.

Or are you not quite describing the issue such as you want "if ANY of the Idkey value has a 'score' "? Which would mean summarize the data to get that before combining the results.

Note: Reference table information at this stage isn't adding to the complexity and could be ignored as there are many ways to 1-to-1 lookup later based on the Idkey values as needed.

novicenoice · Posted 06-07-2024 11:59 AM

For this particular dataset, the reference table contains incident-specific information for each ID key. So the ID key is unique and the matching ID key in all other tables are created from that ID key. The reason for duplicate records is if Score1 or Score2, for example, have multiple entries depending on who and when the record was documented.

In my mind, in creating flags, my final table will be Table 1 with columns containing the flags. Maybe my approach isn't the most appropriate approach...?

ballardw · Posted 06-07-2024 02:15 PM

@novicenoice wrote:

For this particular dataset, the reference table contains incident-specific information for each ID key. So the ID key is unique and the matching ID key in all other tables are created from that ID key. The reason for duplicate records is if Score1 or Score2, for example, have multiple entries depending on who and when the record was documented.

In my mind, in creating flags, my final table will be Table 1 with columns containing the flags. Maybe my approach isn't the most appropriate approach...?

I am not sure quite where you are going with this but it sounds like your tables with the Scores should include one or more additional variables that indicate why there are duplicate Id Key values. That may be a "who" or a "when" or maybe both if the same "who" is associated with entering or assigning the scores. The additional identification variables should allow matching up the related score1, score2 (etc) that should be together for anything requiring both, or multiple, scores.

Sometimes the data is just ugly and requires more work than you expected as you dig into it.

Get the rules for how the "multiple" scores are to be combined for use though. As I said earlier, once the multiple cases are organized then add the reference data will be easy.

Quentin · Posted 06-07-2024 02:26 PM

This is hard to follow. But I would suggest starting by selecting the records you want from table 2 (non-missing values for score) and de-duplicating it. Then would select the records you want from table 3 ( Score2 IN("Negative", "Positive", "Non-conclusive") ) and de-duplicating it.

Then you would have three files each with a unique value, and you can merge them together without worrying about duplicates. If you want the flags, you can make them at that point.

Sarath_A_SAS · Posted 09-04-2024 10:24 PM

/* Importing the Reference Table */
data reference_table;
    input ID_key Minutes Hospital $;
    datalines;
1 4 A
2 1 B
3 8 A
4 6 C
5 9 F
;
run;

/* Importing the Reference Table */
data table2;
    input ID_key score$;
    datalines;
ID key	Score
1	 
1	 
1	3
2	2
3	 
4	1
4	 
5	 
5	
;
run;

data table3;
    input ID_key score2$;
    datalines;

ID key	Score 2
1	Positive
2	N/A
2	Positive
2	N/A
3	Negative
3	 
3	Negative
4	 
4	 
5	Non-conclusive
5	Non-conclusive
run;


/* Step 1: Flagging numeric variable in Table 2 */
data table2_flagged;
    set table2;
    if not missing(Score) then Score_Flag = 1;
    else Score_Flag = 0;
run;

/* Step 2: Flagging categorical variable in Table 3 */
data table3_flagged;
    set table3;
    if not missing(Score2) then Score2_Flag = 1;
    else Score2_Flag = 0;
run;

/* Step 3: Aggregating flags for Table 2 */
proc sql;
    create table table2_agg as
    select ID_key, max(Score_Flag) as Score
    from table2_flagged
    group by ID_key;
quit;

/* Step 4: Aggregating flags for Table 3 */
proc sql;
    create table table3_agg as
    select ID_key, max(Score2_Flag) as Score2
    from table3_flagged
    group by ID_key;
quit;

/* Step 5: Merging all tables */
proc sql;
    create table final_table as
    select ref.ID_key, ref.Minutes, ref.Hospital,
           coalesce(t2.Score, 0) as Score, /* Default to 0 if missing */
           coalesce(t3.Score2, 0) as Score2 /* Default to 0 if missing */
    from reference_table as ref
    left join table2_agg as t2 on ref.ID_key = t2.ID_key
    left join table3_agg as t3 on ref.ID_key = t3.ID_key;
quit;

/* Step 6: Adding dependent flags (example) */
data final_table;
    set final_table;
    if Score = 1 and Score2 = 1 then Valid_Flag = 1;
    else Valid_Flag = 0;
run;

To tackle this problem, you can approach it step by step by:

Creating flags in individual tables for numeric and categorical variables.
Merging the tables carefully to ensure the duplication issue doesn't inflate the number of records.
Combining the results to achieve your desired "want" table with the valid records marked by flags.

How do I flag across different tables with duplicates for numeric and categorial variables

Re: How do I flag across different tables with duplicates for numeric and categorial variables

Re: How do I flag across different tables with duplicates for numeric and categorial variables

Re: How do I flag across different tables with duplicates for numeric and categorial variables

Re: How do I flag across different tables with duplicates for numeric and categorial variables

Re: How do I flag across different tables with duplicates for numeric and categorial variables

How do I flag across different tables with duplicates for numeric and categorial variables

Re: How do I flag across different tables with duplicates for numeric and categorial variables

Re: How do I flag across different tables with duplicates for numeric and categorial variables

Re: How do I flag across different tables with duplicates for numeric and categorial variables

Re: How do I flag across different tables with duplicates for numeric and categorial variables

Re: How do I flag across different tables with duplicates for numeric and categorial variables

Registration is open

SAS Training: Just a Click Away