Hi,
Any help would be greatly appreciated.
I have a big file (>3.5 mil rows) describing weather. The data is coming as a single variable and it is a succession of filled rows with text broken by an empty row when weather for a new location is being described. There is no set number of rows for a particular block of data. There are also blocks of information not necessarily related to a particular location. The number of rows in a particular block can go from 1 to couple of 10s of rows. I would like to assign a unique identifier to each block and to then identify the number of rows in each block. It looks like this:
DATA
WHISTLER.
TODAY..MAINLY CLOUDY. CHANCE OF SHOWERS. HIGH 19. UV INDEX 4.8 OR
MODERATE.
TONIGHT..CLOUDY WITH SHOWERS LIKELY. LOW 10.
FRIDAY..CLOUDY WITH SUNNY PERIODS. CHANCE OF A SHOWER. HIGH 24.
PROBABILITY OF PRECIPITATION IN PERCENT 60 TODAY. 70 TONIGHT AND
40 FRIDAY.
SUNSHINE COAST.
TODAY..CLOUDY WITH SHOWERS LIKELY. HIGH 18. UV INDEX 4.6 OR MODERAT
TONIGHT..CLOUDY WITH SHOWERS LIKELY. LOW 12.
FRIDAY..SUNNY WITH CLOUDY PERIODS. CHANCE OF A SHOWER. HIGH 22.
PROBABILITY OF PRECIPITATION IN PERCENT 70 TODAY. 70 TONIGHT AND
40 FRIDAY.
EAST VANCOUVER ISLAND.
TODAY..MAINLY CLOUDY. SHOWERS LIKELY. WINDY NEAR THE STRAIT. HIGH 1
UV INDEX 4.8 OR MODERATE.
TONIGHT..CLOUDY WITH SHOWERS LIKELY. WINDY NEAR THE STRAIT. LOW 12.
FRIDAY..SUNNY WITH CLOUDY PERIODS. CHANCE OF A SHOWER NORTHERN
SECTIONS. HIGH 21 TO 24.
PROBABILITY OF PRECIPITATION IN PERCENT 70 TODAY. 70 TONIGHT AND
40 FRIDAY.
WEST VANCOUVER ISLAND - COASTAL SECTIONS.
TODAY..CLOUDY WITH SHOWERS. WIND SOUTHEAST 40 TO 60 KM/H OVER
NORTHERN SECTIONS. HIGH 16. UV INDEX 4.8 OR MODERATE.
TONIGHT..CLOUDY WITH SHOWERS. WIND SOUTHEAST 40 TO 60 KM/H OVER
NORTHERN SECTIONS EASING THIS EVENING. LOW 12.
FRIDAY..CLOUDY WITH SUNNY PERIODS. CHANCE OF A MORNING SHOWER.
HIGH 18.
PROBABILITY OF PRECIPITATION IN PERCENT 100 TODAY. 90 TONIGHT AND
40 FRIDAY.
WEST VANCOUVER ISLAND - INLAND SECTIONS.
TODAY..CLOUDY WITH A FEW SHOWERS. HIGH 18. UV INDEX 4.7 OR MODERATE
TONIGHT..CLOUDY WITH A FEW SHOWERS. LOW 12.
FRIDAY..CLOUDY WITH SUNNY PERIODS. CHANCE OF A SHOWER MAINLY NORTHE
SECTIONS. HIGH 24.
PROBABILITY OF PRECIPITATION IN PERCENT 80 TODAY. 80 TONIGHT AND
40 FRIDAY.
NORTH VANCOUVER ISLAND.
TODAY..CLOUDY WITH SHOWERS. WIND SOUTHEAST 40 TO 60 KM/H. HIGH 16.
UV INDEX 4.5 OR MODERATE.
TONIGHT..CLOUDY WITH SHOWERS. SOUTHEAST 40 TO 60 KM/H EASING THIS
EVENING. LOW 11.
FRIDAY..CLOUDY WITH SUNNY PERIODS. CHANCE OF SHOWERS. HIGH 17.
PROBABILITY OF PRECIPITATION IN PERCENT 90 TODAY. 90 TONIGHT AND
60 FRIDAY.
You must already have a variable named BULLETIN and it is a CHARACTER variable.
To use the SUM statement (really to use the RETAIN feature that it implies) you need to sum into a NEW NUMERIC variable.
data have;
input text $20.;
cards;
asdfs
asdfasd
sdfas
adsfa
adsfas
;
data have;
set have;
group + missing(lag(text));
run;
It is not clear from your data dump what is the name of the dataset or the name of the variable with the text.
So lets just assume the dataset is named HAVE and the variable is named TEXT.
data want;
set have;
group + missing(text);
run;
That will put number the groups starting with 0. The blank line will be the beginning of a new group.
It doesn't work.
However, I found a solution.
First, I identify the blank row because with the Len() function will give me the value of 1.
Then I use the lag1 function for this newly created variable to identify the start of a new block of text.
Knowing the beginning of the new block, define a new variable ID = _N_ where lag1(Len()) = 1.
Then using retain statement assign the same _N_ value to the rest of each block's rows.
Identifying the number of rows in each block then becomes easy,
The LENGTH() function will return 1 for an empty value, but also for a value with only one character. Note that it might be that the empty values are not actually empty, but instead of a single non-blank character. perhaps a Carriage Return ('0D'x) or and LineFeed ('0A'x) or some other invisible character.
The LENGTHN() function will return 0 for an empty value.
You seem to have split the simple SUM STATEMENT used in my example back into its pieces.
Try testing if the previous line only has spaces, cr, lf, tab or non-breaking spaces to indicate the start of a new group.
data want;
set have;
group + missing(compress(lag(TEXT),'0D0A0900A0'x));
run;
That will start the numbering at the line AFTER the blank line. Since the first call to LAG() always returns empty value since nothing has yet been given to it to remember it also means it will start numbering from 1 instead of 0.
Thank you.
This is what I get:
179 data want;
180 set B2016_21;
181 Bulletin + missing(compress(lag(TEXT),'0D0A0900A0'x));
-
400
ERROR 400-185: The SUM statement requires numeric expression.
182 run;
NOTE: Numeric values have been converted to character values at the places given by:
(Line):(Column).
181:31
NOTE: The SAS System stopped processing this step because of errors.
WARNING: The data set WORK.WANT may be incomplete. When this step was stopped there were 0
observations and 2 variables.
NOTE: DATA statement used (Total process time):
real time 0.03 seconds
cpu time 0.03 seconds
183 data want;
184 set B2016_21;
185 Bulletin + missing(compress(lag(Bulletin),'0D0A0900A0'x));
-
400
ERROR 400-185: The SUM statement requires numeric expression.
186 run;
NOTE: The SAS System stopped processing this step because of errors.
WARNING: The data set WORK.WANT may be incomplete. When this step was stopped there were 0
observations and 1 variables.
WARNING: Data set WORK.WANT was not replaced because this step was stopped.
NOTE: DATA statement used (Total process time):
real time 0.01 seconds
cpu time 0.01 seconds
You must already have a variable named BULLETIN and it is a CHARACTER variable.
To use the SUM statement (really to use the RETAIN feature that it implies) you need to sum into a NEW NUMERIC variable.
data have;
input text $20.;
cards;
asdfs
asdfasd
sdfas
adsfa
adsfas
;
data have;
set have;
group + missing(lag(text));
run;
GROUP is just the name of the variable.
That is a SUM statement.
https://documentation.sas.com/doc/en/pgmsascdc/9.4_3.5/lestmtsref/n1dfiqj146yi2cn1maeju9wo7ijs.htm
Then syntax is VARIABLE + EXPRESSION ;
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.