Zodiac Wang
  • Home
  • Categories
  • Tags
  • Archives

一道编程题

女朋友找工作时遇到的一道编程题

文本处理是非常常见但又非常重要的任务。其中操作纷繁复杂。而今天我们的目标就是制作一个小型的文本预处理器。其主要功能就是对文本进行预处理以便后续进行固定宽度的排版。为了方便说明,我们定义如下的概念:

  • ● 空白字符(white space):指空格 ' '。
  • ● 文本字符(character):指大写或者小写的英文字母。
  • ● 节(segment):指一串(大于或者等于一个)连续的空白字符或者文本字符。

我们的文本处理器的功能还很朴素,无法处理除 空白字符 和 文本字符 之外的字符。我们将会使用一个固定长度作为宽度进行排版。大于这个宽度的字符会被折行。而折行不会显示任何连字符(例如 “-”),也无需对 空白字符 进行额外处理。例如,假设我们规定最大宽度为30,则 "The main theme of education in engineering school is learning toteach yourself" 将排版为:

The main theme of education in
 engineering school is learnin
g to teach yourself

现在请书写一个函数,该函数的输入为两个参数:

  • ● 需要处理的文本
  • ● 排版宽度。

该函数的返回值为预处理后的文本。预处理后的文本为每一节,及其所在的行号。中间以分号隔开。若一个节跨越了多行,则行号用逗号隔开,并从小到大进行排列。例如,假设输入为:The main theme of education in engineering school is learning to teach yourself,并且排版宽度指定为 30,则返回:

The(1); (1);main(1); (1);theme(1); (1);of(1); (1);education(1); (1);in(1); (2);engineering(2); (2);school(2); (2);is(2); (2);learning(2,3); (3);to(3); (3);teach(3); (3);yourself(3);

又例如,假设输入为:"So many whitespaces",而排版宽度为 10,则返回:

So(1); (1);many(1); (1);whitespaces(2,3);

Python demo

首先将句子拆分为,单次和空格的数组

In [1]:
strr = "The main theme of education in engineering    \
school is learning to teach yourself"

storage = []
buff_str = ""

# 字符串分割为数组
for i in range(len(strr)):
    if not i:
        flag = (strr[i] == " ")
        buff_str += strr[i]
    else:
        if (strr[i] == " ") == flag:
            buff_str += strr[i]
        else:
            storage.append(buff_str)
            buff_str = strr[i]
            flag = (strr[i] == " ")
storage.append(buff_str)
In [2]:
storage
Out[2]:
['The',
 ' ',
 'main',
 ' ',
 'theme',
 ' ',
 'of',
 ' ',
 'education',
 ' ',
 'in',
 ' ',
 'engineering',
 '    ',
 'school',
 ' ',
 'is',
 ' ',
 'learning',
 ' ',
 'to',
 ' ',
 'teach',
 ' ',
 'yourself']

对buff区计数,以记录每个节的行号

In [3]:
# 判断行号
storage_new = [0]*len(storage)

for i in range(len(storage)):
    storage_new[i] = []

buff_count = 0
line_length = 30

cnt = 1
for index, element in enumerate(storage):
    buff_count += len(element)
    if buff_count < line_length:
        storage_new[index].append(cnt)
    elif buff_count == line_length:
        buff_count = 0
        storage_new[index].append(cnt)
        cnt += 1
    else:
        x, y = divmod(buff_count, line_length)
        for i in range(x+1):
            storage_new[index].append(cnt+i)
        buff_count = y
        cnt += x
In [4]:
storage_new
Out[4]:
[[1],
 [1],
 [1],
 [1],
 [1],
 [1],
 [1],
 [1],
 [1],
 [1],
 [1],
 [2],
 [2],
 [2],
 [2],
 [2],
 [2],
 [2],
 [2, 3],
 [3],
 [3],
 [3],
 [3],
 [3],
 [3]]

最后输出

In [5]:
# 输出
for index, element in enumerate(storage):
    print(element+"(", end="")
    for i in range(len(storage_new[index])-1):
        print(str(storage_new[index][i])+",", end="")
    print(str(storage_new[index][-1])+");", end="")
The(1); (1);main(1); (1);theme(1); (1);of(1); (1);education(1); (1);in(1); (2);engineering(2);    (2);school(2); (2);is(2); (2);learning(2,3); (3);to(3); (3);teach(3); (3);yourself(3);

  • « collections模块小结
  • Ubuntu下搭建深度学习环境(CUDA-cuDNN-TensorFlow) »

Published

10 3, 2018

Category

posts

Tags

  • Algorithm 1
  • 文本处理 1

Contact

  • Zodiac Wang - A Fantastic Learner
  • Powered by Pelican. Theme: Elegant