Re: Observations decreases by 1 every time I run the code

kristiepauly · Posted 08-03-2022 11:19 AM

Hello,

When I logged on and ran my SAS code this morning, I noticed my number of observations decreased by 1 (yesterday there were 116, today it is 115). I reran just the section below and the observation were 114. It decreases by one every time I run the data step. I would like to know why this is happening and how to fix it.

Data datasets.REDCap_PDE0;
	Set datasets.redcap_pde0;
	age_first_pres_days_num= INPUT(age_first_pres_days, 4.);
	if _N_ < 2 then delete;
Run;	

Proc contents data=datasets.REDCap_PDE0;
	run;


The CONTENTS Procedure

Data Set Name	DATASETS.REDCAP_PDE0	Observations	113
Member Type	DATA	Variables	66
Engine	V9	Indexes	0
Created	08/03/2022 09:08:32	Observation Length	7336
Last Modified	08/03/2022 09:08:32	Deleted Observations	0
Protection	 	Compressed	NO
Data Set Type	 	Sorted	NO
Label	 	 	 
Data Representation	SOLARIS_X86_64, LINUX_X86_64, ALPHA_TRU64, LINUX_IA64	 	 
Encoding	utf-8 Unicode (UTF-8)	 	 
Engine/Host Dependent Information
Data Set Page Size	131072
Number of Data Set Pages	7
First Data Page	1
Max Obs per Page	17
Obs in First Data Page	16
Number of Data Set Repairs	0
Filename	/home/u46539590/Datasets/redcap_pde0.sas7bdat
Release Created	9.0401M6
Host Created	Linux
Inode Number	955674674
Access Permission	rw-r--r--
Owner Name	u46539590
File Size	1MB
File Size (bytes)	1048576

Tom · Posted 08-03-2022 11:23 AM

Data datasets.REDCap_PDE0;
	Set datasets.redcap_pde0;
	age_first_pres_days_num= INPUT(age_first_pres_days, 4.);
	if _N_ < 2 then delete;
Run;

You are using the same data as the input and output of this data step. (The random upcase letters in the output dataset name do not matter as SAS names are not case sensitive.)

Since you have that code that deletes the first observation then by definition it is goind to delete one more observation everything time it runs.

Do NOT use the same name as the input and output of a data step. It will confuse you and make it impossible to fix mistakes since the original input will be gone.

ballardw · Posted 08-03-2022 11:26 AM

Tell me exactly what you think this code does, especially the part highlighted in red.

Data datasets.REDCap_PDE0;
	Set datasets.redcap_pde0;
	age_first_pres_days_num= INPUT(age_first_pres_days, 4.);
	if _N_ < 2 then delete;
Run;

The _n_ automatic variable indicates the number of times the data step "loops". If you aren't doing something with multiple SET or similar statements then it in effect counts the observations in the data set.

That statement says "delete the first record". So every time you rerun the data you lose another record.

And this a perfect example of when NOT to use the same data set on the Data statement and the Set statement as you do with:

Data datasets.REDCap_PDE0;
	Set datasets.redcap_pde0;

that construct completely replaces the existing datasets.redcap_pde0 every time the data step runs.

You are usually better off creating a new data set to avoid losing data.

Which does bring up WHY did you rerun that data step at all? It's only purpose is to read a text value into a numeric as far as I can see. After you have done that you should have no need to run the data again.

kristiepauly · Posted 08-03-2022 01:22 PM

Hi @ballardw. My understanding of the code you highlighted red was that SAS will ignore my first row "variable" and start with data on the second row. I originally had the DATA and the SET statements pointing to different datasets (data: want and set: import). This was working yesterday but then based on a comment to a different question I had, I was under the impression the data and set needed to be the same in order to read correctly. Perhaps I misunderstood what was being said.

I understand the concept of not overwriting the data, I changed DATA and SET to match out of desperation to get something to work.

DATA is using the information from the data in the SET statement and producing your newly named dataset. Why do I see multiple Data set names in SAS code. What is the purpose of that? Maybe this is what @Kurt_Bremser was saying, "Always create a sequence of datasets from step to step, so you can always go back to a correct state and do not have to rerun the whole process, from initial data import on." Every data step should be a new dataset??

e.g. Data want_1;

set have;

code;

run;

Data want_2;

set want_1;

code;

run;

I reran the data step because I thought I had to run this every time I opened sas in order for it to start reading at obs #2.

Since I've overwritten the data, I need to download the original dataset again and change all the SET statements from "SET datasets.redcap_pde0" back to "SET import", correct?

Reeza · Posted 08-03-2022 01:33 PM

Yes and no. You should still have the import data set. IMO your code should look something like the following for data cleaning.

Note that I save both the actual import data, the data with the first row deleted and then just the final clean data. These three data sets are the primary ones, with the cleanData being the one used for analysis. And if I needed to create new variables for an analysis, I'd come back to this program add my cleaning steps in here so that my analysis always used the cleanData file.

*save raw imported data to library;
data datasets.rawdata;
set datasets.import; *imported data;
	if _N_ < 2 then delete;
run;


Data cleaning1;
	Set datasets.rawdata;
	age_first_pres_days_num= INPUT(age_first_pres_days, 4.);
Run;	

data cleaning2;
set cleaning1;
length IQ $50.;
IF diag_iq_norm_range = " " then IQ = "Missing";
else IF diag_iq = "0" then IQ = put(diag_iq_norm_range, $diag_iq_norm_range.);
else IF diag_iq = "1" then IQ = put(diag_global_dev_delay, $diag_global_dev_delay.);
ELSE IQ = "Missing";
run;

data cleaning3;
set cleaning2;
.....other steps....;
run;

*insert other steps as needed;

*save cleaned data for analysis to library;
data datasets.cleanData;
set cleaningN;*last data set that is clean;
run;

ballardw · Posted 08-03-2022 01:41 PM

@kristiepauly wrote:

Hi @ballardw. My understanding of the code you highlighted red was that SAS will ignore my first row "variable" and start with data on the second row. I originally had the DATA and the SET statements pointing to different datasets (data: want and set: import). This was working yesterday but then based on a comment to a different question I had, I was under the impression the data and set needed to be the same in order to read correctly. Perhaps I misunderstood what was being said.

I understand the concept of not overwriting the data, I changed DATA and SET to match out of desperation to get something to work.

Yep misunderstanding most likely, whether of the comment or a suggestion.

Describe the "desperation to get something to work". That likely points to an issue where you want to provide an example of what you have and then what you want for the result. For example the data step you were running unless your source data changes did not ever need to be run again after you created the numeric value. You wrote the output to a permanent library. You can read that result from the permanent library any time the library is available. So there was no need to rewrite the values.

@Kurt_Bremser's comment about "sequential sets" is an important bit for beginners. It is extremely likely that after you have the second data step working correctly as intended that you might move the code into the previous data step to reduce the number of sets actually needed. This is quite common with processing similar data at different times and learning how to write and modify code. Trivial example such as yours. Developing what a project may require I do this.

Data want_1;
       set have; 
       <recode variablex to RecodeVariablex values>
run;

Data want_2;
     set want_1;
     <recode variableY to RecodeVariableY values>
run;

After I have all of the above working in preparation for next week/month/quarter/year report preparation I would likely reduce that to a single data step:

Data want_1;
       set have; 
       <recode variablex to RecodeVariablex values>
       <recode variableY to RecodeVariableY values>
run;

I might change the data set name to Want_2 if that was the set used for the report/analysis/graphing steps so that I didn't have to modify that code.

I have inherited projects that started with 8 or 10 "sequential data steps" because the person(s) developing the reports weren't experienced enough. Combined the data steps and changed about a 1000 lines of If/then/else coding to use Arrays reducing the lines to less than a third of the original code.

ballardw · Posted 08-03-2022 11:39 PM

@kristiepauly wrote:

I reran the data step because I thought I had to run this every time I opened sas in order for it to start reading at obs #2.

If you want to start processing a data set and NOT remove the first record but do something, again from your code not needed after the first time, you would do something like:

Data want; 
    set have;
    if _n_ > 1 then <do what you want on all records after the first>;
run;

The question might be why the second record? Do you think you are avoiding the header record? If that is your reason then it is a misunderstanding of how SAS processes data. The first record is data. The header elements are separate from the data.

You do have to go back to re-importing or reading the data unless you have another version of the complete data set somewhere else. Once this step is run and you have placed the desired values into a data set in a permanent library then you USE the data by referencing that permanent data on either SET statements for data step or DATA= for use in procs. Do not rerun that data step unless the source data actually changes.

Reeza · Posted 08-03-2022 11:38 AM

Data datasets.REDCap_PDE0;
	Set datasets.redcap_pde0;
	age_first_pres_days_num= INPUT(age_first_pres_days, 4.);
	if _N_ < 2 then delete;
Run;

Your input and output data set have the same name. This is not recommended.

And yes, each time you run this you'd lose one record because you're always removing the first record.

Make sure your input and output data set names are different.

data datasets.master;
set import;
run;


Data datasets.REDCap_PDE0;
	Set datasets.master;
	age_first_pres_days_num= INPUT(age_first_pres_days, 4.);
	if _N_ < 2 then delete;
Run;

Kurt_Bremser · Posted 08-03-2022 12:28 PM

The classical example why directly overwriting a dataset in a step MUST NOT BE DONE. Always create a sequence of datasets from step to step, so you can always go back to a correct state and do not have to rerun the whole process, from initial data import on.

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

Tom · Posted 08-03-2022 02:41 PM

If this is related to your earlier thread where it looked like you were creating data from an XLSX file that had variable names in the first row and variable labels (descriptions) in the second row, then just fix the PROC IMPORT step and you won't need to delete the extra observation caused by the extra row.

proc import dbms=xlsx datafile='myfile.xlsx' out=want replace;
   datarow=3;
run;

If instead you are creating the dataset by reading a CSV file (whether or not it was generated by EXCEL) that also had an extra row, then skip PROC IMPORT and instead write your own data step to read the CSV file and tell it to start from the third line instead of the second line.

data want;
  infile 'myfile.csv' dsd truncover firstobs=3;
  ... rest of data step to read the file ...
run;