SEQUENCE - Letter Sequence Analysis

no tags 

Cryptographic analysis makes extensive use of the frequency with which letters and letter sequences occur in a language. If an encrypted text is known to be in english, for example, a great deal can be learned from the fact that the letters E, L, N, R, S, and T are the most common ones used in written english. Even more can be learned if common letter pairs, triplets, etc. are known.

For this problem you are to write a program which accepts as input a text file of unspecified length and performs letter sequence analysis on the text. The program will report the five most frequent letter sequences for each set of sequences from one to five letters. That is it will report the individual characters which occur with the five highest frequencies, the pairs of characters which occur with the five highest frequencies, and so on up to the letter sequences of five characters which occur with the five highest frequencies.

The program should consider contiguous sequences of alphabetic characters only, and case should be ignored (e.g. an `a' is the same as an `A'). A report should be produced using the format shown in the example at the end of this problem description. For each sequence length from one to five, the report should list the sequences in descending order of frequency. If there are several sequences with the same frequency then all sequences should be listed in alphabetical order as shown (list all sequences in upper case). Finally, if there are less than five distinct frequencies for a particular sequence length, simply report as many distinct frequency lists as possible.

When a text file containing simply the line ``Peter Piper Picks Pickles!'' is used as input, the output should appear as shown here:

Analysis for Letter Sequences of Length 1
-----------------------------------------
Frequency = 5, Sequence(s) = (P)
Frequency = 4, Sequence(s) = (E)
Frequency = 3, Sequence(s) = (I)
Frequency = 2, Sequence(s) = (C,K,R,S)
Frequency = 1, Sequence(s) = (L,T)

Analysis for Letter Sequences of Length 2
-----------------------------------------
Frequency = 3, Sequence(s) = (PI)
Frequency = 2, Sequence(s) = (CK,ER,IC,PE)
Frequency = 1, Sequence(s) = (ES,ET,IP,KL,KS,LE,TE)

Analysis for Letter Sequences of Length 3
-----------------------------------------
Frequency = 2, Sequence(s) = (ICK,PIC)
Frequency = 1, Sequence(s) = (CKL,CKS,ETE,IPE,KLE,LES,PER,PET,PIP,TER)

Analysis for Letter Sequences of Length 4
-----------------------------------------
Frequency = 2, Sequence(s) = (PICK)
Frequency = 1, Sequence(s) = (CKLE,ETER,ICKL,ICKS,IPER,KLES,PETE,PIPE)

Analysis for Letter Sequences of Length 5
-----------------------------------------
Frequency = 1, Sequence(s) = (CKLES,ICKLE,PETER,PICKL,PICKS,PIPER)
When the first three paragraphs of this problem description are used as input, the output should appear as shown here:

Analysis for Letter Sequences of Length 1
-----------------------------------------
Frequency = 201, Sequence(s) = (E)
Frequency = 112, Sequence(s) = (T)
Frequency = 96, Sequence(s) = (S)
Frequency = 90, Sequence(s) = (R)
Frequency = 84, Sequence(s) = (N)

Analysis for Letter Sequences of Length 2
-----------------------------------------
Frequency = 37, Sequence(s) = (TH)
Frequency = 33, Sequence(s) = (EN)
Frequency = 27, Sequence(s) = (HE)
Frequency = 24, Sequence(s) = (RE)
Frequency = 23, Sequence(s) = (NC)

Analysis for Letter Sequences of Length 3
-----------------------------------------
Frequency = 24, Sequence(s) = (THE)
Frequency = 21, Sequence(s) = (ENC,EQU,QUE,UEN)
Frequency = 12, Sequence(s) = (NCE,SEQ,TER)
Frequency = 9, Sequence(s) = (CES,FRE,IVE,LET,REQ,TTE)
Frequency = 8, Sequence(s) = (ETT,FIV)

Analysis for Letter Sequences of Length 4
-----------------------------------------
Frequency = 21, Sequence(s) = (EQUE,QUEN)
Frequency = 20, Sequence(s) = (UENC)
Frequency = 12, Sequence(s) = (ENCE,SEQU)
Frequency = 9, Sequence(s) = (FREQ,NCES,REQU)
Frequency = 8, Sequence(s) = (ETTE,FIVE,LETT,TTER)

Analysis for Letter Sequences of Length 5
-----------------------------------------
Frequency = 21, Sequence(s) = (EQUEN)
Frequency = 20, Sequence(s) = (QUENC)
Frequency = 12, Sequence(s) = (SEQUE,UENCE)
Frequency = 9, Sequence(s) = (ENCES,FREQU,REQUE)
Frequency = 8, Sequence(s) = (ETTER,LETTE)

hide comments
nadstratosfer: 2017-10-16 20:18:38

Kudos to the setter for taking care of both judge settings and testfile formatting, so no nasty WA's like with so many string problems on SPOJ. Also relaxed time limit so can focus on the design without obfuscating the code with speed tricks. Well designed, useful problem that could otherwise be a nightmare. 1.2K of Python code looks quite elegant considering it wasn't so easy to conceive it, probably a tad masochistic to do it in C. Not something for the "easy AC in 1 go move to tutorial" infants anyway.

TL;DR: Fun problem if you have patience for tedious, real-life coding.

The Bartender: 2011-03-22 01:13:11

Ok

Last edit: 2011-06-18 09:05:52
:D: 2010-07-18 22:53:20

You must read till the End Of File (EOF). Detecting it depends on the language and reading method.

Thomas Price: 2009-10-16 01:20:39

Does anyone know how to read the input for this problem? It doesn't tell you how to know when the input ends.


Added by:Wanderley Guimarăes
Date:2006-06-01
Time limit:0.870s
Source limit:50000B
Memory limit:1536MB
Cluster: Cube (Intel G860)
Languages:All except: ERL JS-RHINO NODEJS PERL6 VB.NET
Resource:ACM Mid Central Regionals 1995