BookmarkSubscribeRSS Feed
☑ This topic is solved. Need further help from the community? Please sign in and ask a new question.
Kouros
Calcite | Level 5

Hi,

 

Any help would be greatly appreciated.

 

I have a big file (>3.5 mil rows) describing weather. The data is coming as a single variable and it is a succession of filled rows with text broken by an empty row when weather for a new location is being described. There is no set number of rows for a particular block of data. There are also blocks of information not necessarily related to a particular location. The number of rows in a particular block can go from 1 to couple of 10s of rows. I would like to assign a unique identifier to each block and to then identify the number of rows in each block. It looks like this:

 

DATA
WHISTLER.
TODAY..MAINLY CLOUDY. CHANCE OF SHOWERS. HIGH 19. UV INDEX 4.8 OR
MODERATE.
TONIGHT..CLOUDY WITH SHOWERS LIKELY. LOW 10.
FRIDAY..CLOUDY WITH SUNNY PERIODS. CHANCE OF A SHOWER. HIGH 24.
PROBABILITY OF PRECIPITATION IN PERCENT 60 TODAY. 70 TONIGHT AND
40 FRIDAY.

 

SUNSHINE COAST.
TODAY..CLOUDY WITH SHOWERS LIKELY. HIGH 18. UV INDEX 4.6 OR MODERAT
TONIGHT..CLOUDY WITH SHOWERS LIKELY. LOW 12.
FRIDAY..SUNNY WITH CLOUDY PERIODS. CHANCE OF A SHOWER. HIGH 22.
PROBABILITY OF PRECIPITATION IN PERCENT 70 TODAY. 70 TONIGHT AND
40 FRIDAY.

 

EAST VANCOUVER ISLAND.
TODAY..MAINLY CLOUDY. SHOWERS LIKELY. WINDY NEAR THE STRAIT. HIGH 1
UV INDEX 4.8 OR MODERATE.
TONIGHT..CLOUDY WITH SHOWERS LIKELY. WINDY NEAR THE STRAIT. LOW 12.
FRIDAY..SUNNY WITH CLOUDY PERIODS. CHANCE OF A SHOWER NORTHERN
SECTIONS. HIGH 21 TO 24.
PROBABILITY OF PRECIPITATION IN PERCENT 70 TODAY. 70 TONIGHT AND
40 FRIDAY.

 

WEST VANCOUVER ISLAND - COASTAL SECTIONS.
TODAY..CLOUDY WITH SHOWERS. WIND SOUTHEAST 40 TO 60 KM/H OVER
NORTHERN SECTIONS. HIGH 16. UV INDEX 4.8 OR MODERATE.
TONIGHT..CLOUDY WITH SHOWERS. WIND SOUTHEAST 40 TO 60 KM/H OVER
NORTHERN SECTIONS EASING THIS EVENING. LOW 12.
FRIDAY..CLOUDY WITH SUNNY PERIODS. CHANCE OF A MORNING SHOWER.
HIGH 18.
PROBABILITY OF PRECIPITATION IN PERCENT 100 TODAY. 90 TONIGHT AND
40 FRIDAY.

 

WEST VANCOUVER ISLAND - INLAND SECTIONS.
TODAY..CLOUDY WITH A FEW SHOWERS. HIGH 18. UV INDEX 4.7 OR MODERATE
TONIGHT..CLOUDY WITH A FEW SHOWERS. LOW 12.
FRIDAY..CLOUDY WITH SUNNY PERIODS. CHANCE OF A SHOWER MAINLY NORTHE
SECTIONS. HIGH 24.
PROBABILITY OF PRECIPITATION IN PERCENT 80 TODAY. 80 TONIGHT AND
40 FRIDAY.

 

NORTH VANCOUVER ISLAND.
TODAY..CLOUDY WITH SHOWERS. WIND SOUTHEAST 40 TO 60 KM/H. HIGH 16.
UV INDEX 4.5 OR MODERATE.
TONIGHT..CLOUDY WITH SHOWERS. SOUTHEAST 40 TO 60 KM/H EASING THIS
EVENING. LOW 11.
FRIDAY..CLOUDY WITH SUNNY PERIODS. CHANCE OF SHOWERS. HIGH 17.
PROBABILITY OF PRECIPITATION IN PERCENT 90 TODAY. 90 TONIGHT AND
60 FRIDAY.

 


 

1 ACCEPTED SOLUTION

Accepted Solutions
Tom
Super User Tom
Super User

You must already have a variable named BULLETIN and it is a CHARACTER variable.

To use the SUM statement (really to use the RETAIN feature that it implies) you need to sum into a NEW NUMERIC variable.

data have;
  input text $20.;
cards;
asdfs
asdfasd

sdfas
adsfa
adsfas
;

data have;
  set have;
  group + missing(lag(text));
run;

Tom_0-1677865506657.png

 

 

View solution in original post

8 REPLIES 8
Tom
Super User Tom
Super User

It is not clear from your data dump what is the name of the dataset or the name of the variable with the text.

So lets just assume the dataset is named HAVE and the variable is named TEXT.

data want;
  set have;
  group + missing(text);
run;

That will put number the groups starting with 0. The blank line will be the beginning of a new group.

Kouros
Calcite | Level 5

It doesn't work.

 

However, I found a solution.

 

First, I identify the blank row because with the Len() function will give me the value of 1.

 

Then I use the lag1 function for this newly created variable to identify the start of a new block of text.

 

Knowing the beginning of the new block, define a new variable ID = _N_ where lag1(Len()) = 1.

 

Then using retain statement assign the same _N_ value to the rest of each block's rows.

 

Identifying the number of rows in each block then becomes easy,

Tom
Super User Tom
Super User

The LENGTH() function will return 1 for an empty value, but also for a value with only one character.  Note that it might be that the empty values are not actually empty, but instead of a single non-blank character. perhaps a Carriage Return ('0D'x) or and LineFeed ('0A'x) or some other invisible character.

 

The LENGTHN() function will return 0 for an empty value.

 

You seem to have split the simple SUM STATEMENT used in my example back into its pieces.  

Try testing if the previous line only has spaces, cr, lf, tab or non-breaking spaces to indicate the start of a new group.

data want;
  set have;
  group + missing(compress(lag(TEXT),'0D0A0900A0'x));
run;

That will start the numbering at the line AFTER the blank line.  Since the first call to LAG() always returns empty value since nothing has yet been given to it to remember it also means it will start numbering from 1 instead of 0.

Kouros
Calcite | Level 5

Thank you.

 

This is what I get:

 

179 data want;
180 set B2016_21;
181 Bulletin + missing(compress(lag(TEXT),'0D0A0900A0'x));
-
400
ERROR 400-185: The SUM statement requires numeric expression.

182 run;

NOTE: Numeric values have been converted to character values at the places given by:
(Line):(Column).
181:31
NOTE: The SAS System stopped processing this step because of errors.
WARNING: The data set WORK.WANT may be incomplete. When this step was stopped there were 0
observations and 2 variables.
NOTE: DATA statement used (Total process time):
real time 0.03 seconds
cpu time 0.03 seconds


183 data want;
184 set B2016_21;
185 Bulletin + missing(compress(lag(Bulletin),'0D0A0900A0'x));
-
400
ERROR 400-185: The SUM statement requires numeric expression.

186 run;

NOTE: The SAS System stopped processing this step because of errors.
WARNING: The data set WORK.WANT may be incomplete. When this step was stopped there were 0
observations and 1 variables.
WARNING: Data set WORK.WANT was not replaced because this step was stopped.
NOTE: DATA statement used (Total process time):
real time 0.01 seconds
cpu time 0.01 seconds

Tom
Super User Tom
Super User

You must already have a variable named BULLETIN and it is a CHARACTER variable.

To use the SUM statement (really to use the RETAIN feature that it implies) you need to sum into a NEW NUMERIC variable.

data have;
  input text $20.;
cards;
asdfs
asdfasd

sdfas
adsfa
adsfas
;

data have;
  set have;
  group + missing(lag(text));
run;

Tom_0-1677865506657.png

 

 

Kouros
Calcite | Level 5
Thank you Tim,

Yes, hat worked perfectly.

What is this group statement? My SAS version, it is not colored as a special word, but it is kept as text, and I cannot find it in the SAS Help.
Tom
Super User Tom
Super User

GROUP is just the name of the variable.

 

That is a SUM statement.

https://documentation.sas.com/doc/en/pgmsascdc/9.4_3.5/lestmtsref/n1dfiqj146yi2cn1maeju9wo7ijs.htm

 

Then syntax is VARIABLE + EXPRESSION ;

Ready to join fellow brilliant minds for the SAS Hackathon?

Build your skills. Make connections. Enjoy creative freedom. Maybe change the world. Registration is now open through August 30th. Visit the SAS Hackathon homepage.

Register today!
How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 8 replies
  • 724 views
  • 1 like
  • 2 in conversation