女朋友找工作时遇到的一道编程题

文本处理是非常常见但又非常重要的任务。其中操作纷繁复杂。而今天我们的目标就是制作一个小型的文本预处理器。其主要功能就是对文本进行预处理以便后续进行固定宽度的排版。为了方便说明，我们定义如下的概念：

● 空白字符（white space）：指空格 ' '。
● 文本字符（character）：指大写或者小写的英文字母。
● 节（segment）：指一串（大于或者等于一个）连续的空白字符或者文本字符。

我们的文本处理器的功能还很朴素，无法处理除空白字符和文本字符之外的字符。我们将会使用一个固定长度作为宽度进行排版。大于这个宽度的字符会被折行。而折行不会显示任何连字符（例如 “-”），也无需对空白字符进行额外处理。例如，假设我们规定最大宽度为30，则 "The main theme of education in engineering school is learning toteach yourself" 将排版为：

The main theme of education in
 engineering school is learnin
g to teach yourself

现在请书写一个函数，该函数的输入为两个参数：

● 需要处理的文本
● 排版宽度。

该函数的返回值为预处理后的文本。预处理后的文本为每一节，及其所在的行号。中间以分号隔开。若一个节跨越了多行，则行号用逗号隔开，并从小到大进行排列。例如，假设输入为：The main theme of education in engineering school is learning to teach yourself，并且排版宽度指定为 30，则返回：

The(1); (1);main(1); (1);theme(1); (1);of(1); (1);education(1); (1);in(1); (2);engineering(2); (2);school(2); (2);is(2); (2);learning(2,3); (3);to(3); (3);teach(3); (3);yourself(3);

又例如，假设输入为："So many whitespaces"，而排版宽度为 10，则返回：

So(1); (1);many(1); (1);whitespaces(2,3);

Python demo

首先将句子拆分为，单次和空格的数组

In [1]:

strr = "The main theme of education in engineering    \
school is learning to teach yourself"

storage = []
buff_str = ""

# 字符串分割为数组
for i in range(len(strr)):
    if not i:
        flag = (strr[i] == " ")
        buff_str += strr[i]
    else:
        if (strr[i] == " ") == flag:
            buff_str += strr[i]
        else:
            storage.append(buff_str)
            buff_str = strr[i]
            flag = (strr[i] == " ")
storage.append(buff_str)

In [2]:

storage

Out[2]:

['The',
 ' ',
 'main',
 ' ',
 'theme',
 ' ',
 'of',
 ' ',
 'education',
 ' ',
 'in',
 ' ',
 'engineering',
 '    ',
 'school',
 ' ',
 'is',
 ' ',
 'learning',
 ' ',
 'to',
 ' ',
 'teach',
 ' ',
 'yourself']

对buff区计数，以记录每个节的行号

In [3]:

# 判断行号
storage_new = [0]*len(storage)

for i in range(len(storage)):
    storage_new[i] = []

buff_count = 0
line_length = 30

cnt = 1
for index, element in enumerate(storage):
    buff_count += len(element)
    if buff_count < line_length:
        storage_new[index].append(cnt)
    elif buff_count == line_length:
        buff_count = 0
        storage_new[index].append(cnt)
        cnt += 1
    else:
        x, y = divmod(buff_count, line_length)
        for i in range(x+1):
            storage_new[index].append(cnt+i)
        buff_count = y
        cnt += x

In [4]:

storage_new

Out[4]:

[[1],
 [1],
 [1],
 [1],
 [1],
 [1],
 [1],
 [1],
 [1],
 [1],
 [1],
 [2],
 [2],
 [2],
 [2],
 [2],
 [2],
 [2],
 [2, 3],
 [3],
 [3],
 [3],
 [3],
 [3],
 [3]]

最后输出

In [5]:

# 输出
for index, element in enumerate(storage):
    print(element+"(", end="")
    for i in range(len(storage_new[index])-1):
        print(str(storage_new[index][i])+",", end="")
    print(str(storage_new[index][-1])+");", end="")

The(1); (1);main(1); (1);theme(1); (1);of(1); (1);education(1); (1);in(1); (2);engineering(2);    (2);school(2); (2);is(2); (2);learning(2,3); (3);to(3); (3);teach(3); (3);yourself(3);

一道编程题

Published

Category

Tags

Contact