DATA Step, Macro, Functions and more

Miscoding mystery

Accepted Solution Solved
Reply
New Contributor
Posts: 4
Accepted Solution

Miscoding mystery

Greetings.

I have dissected and re-written this code in with checks for hours, and am completely stuck as to why the errors are occurring. This code involves taking combinations of cancer histology and primary site codes and mapping them to specific cancer diagnosis types:

data lymphoma2;

     merge lymphoma1 (in=in1 keep=sitecd histocd)

          map (keep=histocd diagcd);    

     by histocd;

     if in1;

     if 700<=sitecd<=729 then diagcd=38;

run;

in the above code, the merge takes the diagnosis codes (diagcd) from dataset "map" and maps them to the histology codes (histcd) in the cancer records from the dataset "lymphoma1". The if statement then takes records with site codes (sitecd) between 700 and 729 and hard codes them specifically to diagcd=38, instead of using the diagcd in the "map" dataset. However, after this data step, there are records where diagcd=38 but not the correct sitecd combination (ie, records with sitecd=770 were also mapped to diagcd=38). What is strange is if I had created a new variable, I wouldn't have the same problem:

data lymphoma2;

     merge lymphoma1 (in=in1 keep=sitecd histocd)

          map (keep=histocd diagcd);    

     by histocd;

     if in1;

     newcd=diagcd;

     if 700<=sitecd<=729 then newcd=38;

run;

I would really appreciate any insight as to why this error occurs. Thank you in advance for your help!

Attachment
Attachment

Accepted Solutions
Solution
‎02-17-2015 04:03 PM
PROC Star
Posts: 1,324

Re: Miscoding mystery

Hi,

Would I be correct to guess that this is a one-to-many merge?  That is, map is unique by histocd but lymphoma1 is not?

When you merge datasets using a BY statement, variables read from a data set are implicitly retained.  The PDV is initialized when the BY group changes.

In a one to many merge, if you reassign a value in the "one" dataset, that value will be retained until the by-group changes.  When I took a SAS training course, they had us work through the merge process and program data vector with paper, to understand this.

Here's an example of your problem:

data many;
input id test score;
cards;
1 1 10
1 2 15
1 3 20
2 1 10
2 2 15
2 3 20
;

data one;
input id gender $1;
cards;
1 M
2 F
;
run;

71   data want;
72     merge many (keep=id test score)
73           one  (keep=id gender)
74     ;
75     by id;
76     if test=2 then gender='X';
77     put (_all_)(=);
78   run;

id=1 test=1 score=10 gender=1
id=1 test=2 score=15 gender=X
id=1 test=3 score=20 gender=X
id=2 test=1 score=10 gender=2
id=2 test=2 score=15 gender=X
id=2 test=3 score=20 gender=X
NOTE: There were 6 observations read from the data set WORK.MANY.
NOTE: There were 2 observations read from the data set WORK.ONE.
NOTE: The data set WORK.WANT has 6 observations and 4 variables.

When you create a new variable, that variable is NOT retained, so everything works as you want.

A "rule" might be to avoid changing the values of any variables when you merge data in a one-to-many merge.  Another sign that you may be headed for trouble is when something just feels odd.  In my above example, why would I recode a patient-attribute (gender), based on a test-level (test) attribute?  In your example, why recode diagnostic code based on site id?

HTH,

--Q.

View solution in original post


All Replies
Super User
Super User
Posts: 7,076

Re: Miscoding mystery

I don't understand the question. Here is sample data that I think represents your "problem".

What value do you want for DIAGCD for the first observation if not the value from MAP?

data lymphoma1 ;

input histocd sitecd ;

cards;

1 700

2 770

;;;;

data map ;

  input histocd diagcd ;

cards;

1 0

2 38

;;;;

data lymphoma2;

     merge lymphoma1 (in=in1 keep=sitecd histocd)

          map (keep=histocd diagcd);  

     by histocd;

     if in1;

     if 700<=sitecd<=729 then diagcd=38;

put (_all_) (=);

run;


histocd=1 sitecd=700 diagcd=38

histocd=2 sitecd=770 diagcd=38

New Contributor
Posts: 4

Re: Miscoding mystery

Dear Tom,

Thank you for responding to my plea for help. Let's change up the example:

data lymphoma1 ;

input histocd sitecd ;

cards;

1 700

2 770

;;;;

data map ;

  input histocd diagcd ;

cards;

1 10

2 11

;;;;


Then based on my code, I would expect the final result:

histocd=1 sitecd=700 diagcd=38 (this is the result of the hard code since sitecd=700)

histocd=2 sitecd=770 diagcd=11 (this comes straight from the map)

but when I run the code with the datasets I attached earlier, diagcd for records with sitecd outside the 700-729 range get changed to 38.

Please let me know if I can clarify further.

Super User
Super User
Posts: 7,076

Re: Miscoding mystery

If you reset the value of the IN= variable at the top of the data step loop you can see when new records are actually read from the dataset.

This will explain why value for DIAGCD are carried forward onto other observations for the same HISTOCD group.  A new record is NOT read from the MAP dataset to get back to the old value of DIAGCD.

data lymphoma ;

input histocd sitecd ;

cards;

1 600

1 700

1 770

;;;;

data map ;

  input histocd diagcd ;

cards;

1 11

;;;;

libname a 'c:\downloads';

data lymphoma2;

in1=0;

in2=0;

     merge lymphoma (in=in1 keep=sitecd histocd)

          map (in=in2 keep=histocd diagcd);

     by histocd;

if first.histocd then n=0;

n+1;

put (in1 in2 n histocd sitecd diagcd ) (=) '-> ' @;

if 700<=sitecd<=729 then diagcd=38;

put diagcd= ;

run;

in1=1 in2=1 n=1 histocd=1 sitecd=600 diagcd=11 -> diagcd=11

in1=1 in2=0 n=2 histocd=1 sitecd=700 diagcd=11 -> diagcd=38

in1=1 in2=0 n=3 histocd=1 sitecd=770 diagcd=38 -> diagcd=38

Solution
‎02-17-2015 04:03 PM
PROC Star
Posts: 1,324

Re: Miscoding mystery

Hi,

Would I be correct to guess that this is a one-to-many merge?  That is, map is unique by histocd but lymphoma1 is not?

When you merge datasets using a BY statement, variables read from a data set are implicitly retained.  The PDV is initialized when the BY group changes.

In a one to many merge, if you reassign a value in the "one" dataset, that value will be retained until the by-group changes.  When I took a SAS training course, they had us work through the merge process and program data vector with paper, to understand this.

Here's an example of your problem:

data many;
input id test score;
cards;
1 1 10
1 2 15
1 3 20
2 1 10
2 2 15
2 3 20
;

data one;
input id gender $1;
cards;
1 M
2 F
;
run;

71   data want;
72     merge many (keep=id test score)
73           one  (keep=id gender)
74     ;
75     by id;
76     if test=2 then gender='X';
77     put (_all_)(=);
78   run;

id=1 test=1 score=10 gender=1
id=1 test=2 score=15 gender=X
id=1 test=3 score=20 gender=X
id=2 test=1 score=10 gender=2
id=2 test=2 score=15 gender=X
id=2 test=3 score=20 gender=X
NOTE: There were 6 observations read from the data set WORK.MANY.
NOTE: There were 2 observations read from the data set WORK.ONE.
NOTE: The data set WORK.WANT has 6 observations and 4 variables.

When you create a new variable, that variable is NOT retained, so everything works as you want.

A "rule" might be to avoid changing the values of any variables when you merge data in a one-to-many merge.  Another sign that you may be headed for trouble is when something just feels odd.  In my above example, why would I recode a patient-attribute (gender), based on a test-level (test) attribute?  In your example, why recode diagnostic code based on site id?

HTH,

--Q.

New Contributor
Posts: 4

Re: Miscoding mystery

The merging seems to work fine - I checked the individual records before and after. I also verified the results with and without the "if" statement. To address your question - there are specific histology codes that are associated with non-Hodgkin lymphoma and leukemia. The "map" dataset, therefore, provides the correct diagnosis code to each of the histology codes from the "lymphoma" dataset. However, if a record with a non-Hodgkin lymphoma histology occurs at a site related to the brain or nervous system (700<=sitecd<=729), we want those records coded specifically to "central nervous system lymphoma" (diagcd 38). This is what the if statement is trying to accomplish, but for some reason, it's also mapping diagcd 38 to other sites. I attached datasets in my original posting, which I hope will illustrate the problem. If you run:

proc freq data=lymphoma2;

     tables diagcd*sitecd/list;

run;

you'll see that there are sitecd>729 that were mapped to diagcd=38. Without the "if" statement in the data step, those records would have been mapped to diagcd=37 or 39.

PROC Star
Posts: 7,492

Re: Miscoding mystery

I think that Quentin already explain why you aren't getting what you expect. The following is one way to control the behavior that he described:

data lymphoma2 (drop=_diagcd);

     merge lymphoma1 (in=in1 keep=sitecd histocd)

           map (keep=histocd diagcd rename=(diagcd=_diagcd));   

     by histocd;

     if in1;

     if 700<=sitecd<=729 then diagcd=38;

     else diagcd=_diagcd;

run;

New Contributor
Posts: 4

Re: Miscoding mystery

Thank you, all! I understand now - this has been very helpful. And I will be sure to create new variables for this purpose in the future.

Super User
Posts: 11,343

Re: Miscoding mystery

You may want to invest time in some custom formats if you do a lot of this type of coding. Once you have the formats or informats created you can do a lot with things, sometimes without even creating new variables, for grouping variables.

For instance I have codes that use clinic level identifiers to create regions or agency information. And if one of the clinics changes I only have to change that in the format and the new behavior is available at the next report time without changing the report code.

🔒 This topic is solved and locked.

Need further help from the community? Please ask a new question.

Discussion stats
  • 8 replies
  • 326 views
  • 3 likes
  • 5 in conversation