女朋友找工作时遇到的一道编程题
文本处理是非常常见但又非常重要的任务。其中操作纷繁复杂。而今天我们的目标就是制作一个小型的文本预处理器。其主要功能就是对文本进行预处理以便后续进行固定宽度的排版。为了方便说明,我们定义如下的概念:
- ● 空白字符(white space):指空格 ' '。
- ● 文本字符(character):指大写或者小写的英文字母。
- ● 节(segment):指一串(大于或者等于一个)连续的空白字符或者文本字符。
我们的文本处理器的功能还很朴素,无法处理除 空白字符 和 文本字符 之外的字符。我们将会使用一个固定长度作为宽度进行排版。大于这个宽度的字符会被折行。而折行不会显示任何连字符(例如 “-”),也无需对 空白字符 进行额外处理。例如,假设我们规定最大宽度为30,则 "The main theme of education in engineering school is learning toteach yourself" 将排版为:
The main theme of education in
engineering school is learnin
g to teach yourself
现在请书写一个函数,该函数的输入为两个参数:
- ● 需要处理的文本
- ● 排版宽度。
该函数的返回值为预处理后的文本。预处理后的文本为每一节,及其所在的行号。中间以分号隔开。若一个节跨越了多行,则行号用逗号隔开,并从小到大进行排列。例如,假设输入为:The main theme of education in engineering school is learning to teach yourself,并且排版宽度指定为 30,则返回:
The(1); (1);main(1); (1);theme(1); (1);of(1); (1);education(1); (1);in(1); (2);engineering(2); (2);school(2); (2);is(2); (2);learning(2,3); (3);to(3); (3);teach(3); (3);yourself(3);
又例如,假设输入为:"So many whitespaces",而排版宽度为 10,则返回:
So(1); (1);many(1); (1);whitespaces(2,3);
Python demo
首先将句子拆分为,单次和空格的数组
strr = "The main theme of education in engineering \
school is learning to teach yourself"
storage = []
buff_str = ""
# 字符串分割为数组
for i in range(len(strr)):
if not i:
flag = (strr[i] == " ")
buff_str += strr[i]
else:
if (strr[i] == " ") == flag:
buff_str += strr[i]
else:
storage.append(buff_str)
buff_str = strr[i]
flag = (strr[i] == " ")
storage.append(buff_str)
storage
对buff区计数,以记录每个节的行号
# 判断行号
storage_new = [0]*len(storage)
for i in range(len(storage)):
storage_new[i] = []
buff_count = 0
line_length = 30
cnt = 1
for index, element in enumerate(storage):
buff_count += len(element)
if buff_count < line_length:
storage_new[index].append(cnt)
elif buff_count == line_length:
buff_count = 0
storage_new[index].append(cnt)
cnt += 1
else:
x, y = divmod(buff_count, line_length)
for i in range(x+1):
storage_new[index].append(cnt+i)
buff_count = y
cnt += x
storage_new
最后输出
# 输出
for index, element in enumerate(storage):
print(element+"(", end="")
for i in range(len(storage_new[index])-1):
print(str(storage_new[index][i])+",", end="")
print(str(storage_new[index][-1])+");", end="")