PK11F - Spelling Suggestion

no tags 

A spelling suggestion is a part of spelling correction program that generates a set of plausible replacements for words that are likely to be misspelled. One way to measure the plausibility of these replacements is to compute their edit distance against a given misspelled word. The edit distance between two words is the total number of edit operations that have to be done in order to transform one word into the other. Normally these edit operations are insertion, deletion and substitution of a single character including transposition of 2 consecutive characters.

For example, for a word “wonder”, if the deletion is applied at the character 'o', this word will transform into “wnder”. And if the substitution with 'a' is applied at 'o', it will become “wander”. And if the transposition is applied at “er”, it will become “wondre”.

In this edit distance strategy, the degree of similarity between two words is up to their minimum edit distance. If the minimum edit distance between word1 and word2 is lower than the distance between word1 and word3, then word1 is more similar to word2 than to word3. So the word2 is a better spelling suggestion for word1, comparing with word3.

Suppose that you are an employee of a software company which needs to build up a prototype of spelling suggestion program. This prototype tends to be a part of word processing software. The requirement is that it has to use the edit distance strategy for their spelling suggestion. But the substitution operation has to be redefined to match the behavior of mistyping. The cost of substitution of a character with another character depends on the position of them on the keyboard layout. If they are close to each other, the cost is only half of the normal one. For this purpose, the substitution is categorized into near-substitution and far-substitution. Their costs are defined as 1 and 2 respectively. And the costs for insertion, deletion and transposition are 2. In addition, this program must run fast enough to pass the time limit that is set by your manager. By the way for this prototype, an English QWERTY keyboard layout is chosen to be used.

Goal

To generate optimum spelling suggestions for each input word, where each optimum spelling suggestion is a word in dictionary that has the least minimum edit distance from the given input word. The time limit for 5,000 misspelled words is less than or equal to 5 minutes.

Input

Input is a standard input which contains 3 parts of data. Each of these three parts ends with a blank line.

  • The first part is a set of near-substitution rules, where each rule is kept in one line. Each line has two fields. Each field is separated by a space. The total number of rules is less than or equal to 150.
    • The first field is a character where it can be near-substituted with other characters.

    • The second field is a sequence of characters which can be near-substituted for the character in the previous field. There is no space in this field.

    • The characters that may be contained in this part are characters that can be typed in using a generic English keyboard layout, which are alpha-numeric characters and some punctuations without space or tab. They are listed as the following.

      abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789`~!@#$%^&*()-_=+\|[{]};:'”,<.>/?
  • The second part is a sequence of words in dictionary, where each word is kept in one line. The total number of words is less than or equal to 150,000. The characters in dictionary are alphabetical characters with an apostrophe punctuation: abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ’
  • The third part is a sequence of words that need to be checked for their spellings. Each of these words is also kept in one line. The total number of words is less than or equal to 5,000. The characters that may be contained in this part are the same as characters in the first part, which are alpha-numeric characters and some punctuation. (See the first part.)
  • Since each of these three parts end with a blank line, the third blank line is the termination of the input.

Output

For each word in the third part, write a line which contains 3 parts of information, separated with a colon.

  • The first part is the given input word.
  • The second part is the minimum edit distance between the input word and suggestion word(s).
  • The third part is an ascending sorted sequence of suggestion word(s), separated with a space. There is no space left after the last word.

Example

Line numberSample Input
1a AqQsSzZ
2b BgGvVnN
3p P0);:oO[{
4r R4$fFeEtT
5z ZaAxX
6
7a
8A
9b
10B
11Z
12angel
13angle
14anger
15angry
16ABC
17
18x
19s
20z
21xx
22xxx
23angre
24angri
25angrt
26angel
27ACB
28BAC
29CAB
30
Line numberSample Output
1x:2:A B Z a b
2s:1:a
3z:1:A Z a
4xx:4:A B Z a b
5xxx:6:A ABC B Z a b
6angre:2:anger angle angry
7angri:2:angry
8angrt:2:anger angry
9angel:0:angel
10ACB:2:ABC
11BAC:2:ABC
12CAB:4:A ABC B

More Explanations

In this sample input, there are 5 near-substitution rules (line number 1-5), 10 words in dictionary (line number 7-16) and 12 words looking for their suggestions (line number 18-29).

In the sample output, there are 12 lines for each corresponding 12 words from the input.

For the 1st word, the minimum edit distance between x and its suggestions (A B Z a b) is 2. All of them are the (far) substitution costs.

For the 2nd word, the minimum edit distance between s and its suggestion (a) is 1, which is a near-substitution cost, guided by the 1st substitution rule.

For the 3rd word, the minimum edit distance between z and its suggestion (A Z a) are 1, which is a near-substitution cost, guided by the 1st or 5thsubstitution rule.

For the 4th word, the minimum edit distance is 4, which are summed from one far-substitution cost and one deletion cost.

For the 5th word, the minimum edit distance is 6. The costs between xxx and (A B Z a b) are from two deletion and one far-substitution costs.

The cost between xxx and ABC re from three far-substitution costs.

For the 6th word, the cost between angre and anger is from one transposition costs. The costs between angre and (angle angry) are from one far-substitution costs.

For the 7th word and 8th word, seem to be similar. But the cost of 7th word is from one far-substitution costs.

But the cost between angrt and anger is from 2 near-substitution costs (r substitutes with e and t substitutes with r) according to the near-substitution rule number 4.

For the 9th word, the word is exactly matched within dictionary. So the cost is 0.

For the 10th and 11th words, each cost is from one transposition costs.

For the 12th words, the cost between CAB and (A B) is from two deletion costs. The cost between CAB and (ABC) is from one deletion and one insertion costs.

Please be notify that it is not from 2 transpositions of CAB to ACB and then ACB to ABC.


hide comments
[Rampage] Blue.Mary: 2023-12-12 09:22:58

There is no blank line at the end of the whole file. This causes me several TLE. The length of all the input string is less than 30, checked by assertion.

Last edit: 2023-12-12 09:24:43
Oleg: 2023-12-12 07:19:10

Hint: Lengths are short (asserted that less than 100 characters).
Naive approach of checking every word in dictionary passes.

:D: 2012-11-21 21:48:23

Description misses one extremely vital constraint. What are the limits for word lengths!?


Added by:Race with time
Date:2012-07-25
Time limit:12.88s
Source limit:50000B
Memory limit:1536MB
Cluster: Cube (Intel G860)
Languages:All except: ASM64
Resource:ACM ICPC Regional Phuket 2011