8. 1. LSTM으로 텍스트 생성하기

시퀀스 데이터를 생성하는 일반적인 방법-이전 토큰을 입력으로 사용해서 시퀀스의 다음 토큰들을 에측 (ex)김진 이라는 단어가 주어지면 우를 예측

이것을 언어 모델이라고 부름.

초기 텍스트 주입→다음 글자 생성→이를 반복

텍스트를 생성할 때 다음 글자를 생성하는 방법은 두 가지로 나뉨

1.탐욕적 샘플링(그리디 알고리즘과 유사):확률이 제일 높은 글자 선택
2.확률적 샘플링:확률적으로 단어를 선택

샘플링의 과정에서 확률의 양을 조절하기 위해서 소프트맥스 온도를 설정,

온도가 낮아지면 어느정도 예측이 되는 단어 생성

온도가 높이지면 신기하고 놀라운 단어 생성

import keras
import numpy as np
import tensorflow

path = tensorflow.keras.utils.get_file(
    'nietzsche.txt',
    origin='<https://s3.amazonaws.com/text-datasets/nietzsche.txt>')
text = open(path).read().lower()
print('말뭉치 크기:', len(text))

#<https://s3.amazonaws.com/text-datasets/nietzsche.txt>

# 60개 글자로 된 시퀀스를 추출합니다.
maxlen = 60

# 세 글자씩 건너 뛰면서 새로운 시퀀스를 샘플링합니다.
step = 3

# 추출한 시퀀스를 담을 리스트
sentences = []

# 타깃(시퀀스 다음 글자)을 담을 리스트
next_chars = []

for i in range(0, len(text) - maxlen, step):
    sentences.append(text[i: i + maxlen])
    next_chars.append(text[i + maxlen])
print('시퀀스 개수:', len(sentences))

# 말뭉치에서 고유한 글자를 담은 리스트
chars = sorted(list(set(text)))      #말뭉치 안에서 글자들을 sort
print('고유한 글자:', len(chars))
# chars 리스트에 있는 글자와 글자의 인덱스를 매핑한 딕셔너리
char_indices = dict((char, chars.index(char)) for char in chars)

# 글자를 원-핫 인코딩하여 0과 1의 이진 배열로 바꿉니다.
print('벡터화...')
x = np.zeros((len(sentences), maxlen, len(chars)), dtype=np.bool)
y = np.zeros((len(sentences), len(chars)), dtype=np.bool)
for i, sentence in enumerate(sentences):
    for t, char in enumerate(sentence):
        x[i, t, char_indices[char]] = 1
    y[i, char_indices[next_chars[i]]] = 1      #원핫 인코딩

from keras import layers

model = keras.models.Sequential()
model.add(layers.LSTM(128, input_shape=(maxlen, len(chars))))
model.add(layers.Dense(len(chars), activation='softmax'))

optimizer = tensorflow.keras.optimizers.RMSprop(lr=0.01)
model.compile(loss='categorical_crossentropy', optimizer=optimizer) #원 핫 인코딩 되어 있어,
#categorical_crossentropy사용

def sample(preds, temperature=1.0):  #함수 정의, 기본 온도는 1.0
    preds = np.asarray(preds).astype('float64')
    preds = np.log(preds) / temperature
    exp_preds = np.exp(preds)     #preds를 float로 변환후 로그를 취하고 온도로 나눔

    preds = exp_preds / np.sum(exp_preds)
    probas = np.random.multinomial(1, preds, 1)
    return np.argmax(probas)
		#exp_preds를 exp_preds의 합으로 나눈뒤, preds의 확률을 한번 다항분포로 실행

import random
import sys

random.seed(2)
start_index = random.randint(0, len(text) - maxlen - 1)
#시드 텍스트 ntranslatable for him. everything

# 60 에포크 동안 모델을 훈련합니다
for epoch in range(1, 60):
    print('에포크', epoch)
    # 데이터에서 한 번만 반복해서 모델을 학습합니다
    model.fit(x, y, batch_size=128, epochs=1)

    # 무작위로 시드 텍스트를 선택합니다
    seed_text = text[start_index: start_index + maxlen]
    print('--- 시드 텍스트: "' + seed_text + '"')

    # 여러가지 샘플링 온도를 시도합니다
    for temperature in [0.2, 0.5, 1.0, 1.2]:
        print('------ 온도:', temperature)
        generated_text = seed_text
        sys.stdout.write(generated_text)

        # 시드 텍스트에서 시작해서 400개의 글자를 생성합니다
				# 400개의 글자를 계속 확률적 샘플링을 통해 생성.
        for i in range(400):
            # 지금까지 생성된 글자를 원-핫 인코딩으로 바꿉니다
            sampled = np.zeros((1, maxlen, len(chars)))
            for t, char in enumerate(generated_text):
                sampled[0, t, char_indices[char]] = 1.

            # 다음 글자를 샘플링합니다
            preds = model.predict(sampled, verbose=0)[0]
            next_index = sample(preds, temperature)
            next_char = chars[next_index]

            generated_text += next_char
            generated_text = generated_text[1:]

            sys.stdout.write(next_char)
            sys.stdout.flush()
        print()

------ 온도: 0.2
ntranslatable for him. everything
ponderous, viscous, and power of the sense and the subject and the moral precisely in the sense of the superficialism of the sense of the sense of the superstition of the most distinguage world of the most destructions of the contrary the actions of the sense of the sense of the superstition of the most distinguage for the superficialism and the sense of the sense of the conscience, as a not the subjective the most such a
------ 온도: 0.5
ntranslatable for him. everything
ponderous, viscous, and possesswive of the german society which they seems to contrary to great and reverence of a man who has in the general is most distinguist of the traditional existence, as a noble of the world, which we must make us a formine, and the ages of the last present of the most successful to a man is no things the senses and superficles, and the senses of the same causa of a bourse and supposes which has be
------ 온도: 1.0
ntranslatable for him. everything
ponderous, viscous, and possessh. is into the finully the , as society what who has its elevation, to much it
use-dargifing the elements in
the toriment which a discovered even spirits to usgable our systems and no rance of true oxgreadneh happne forward."

8

=socies,
fairhing:--the
world in the
exermines wait, the young, how different and most the existence--with selfished and the europeal intentions them to see am which
------ 온도: 1.2
ntranslatable for him. everything
ponderous, viscous, and postime as as the
lofty reality,
postibibrest lead
the accordings is
dongably be
latts, convenies" valuations owing tlose imaged, weneriture darkes and celtauced. with generally preises
are oppressions wors c'vest exertionan is, to us one way moate of others. withlepurting view
geded", not from roul;
why
playing, under rotal,
but vanity that iuld gots.

78. the being did in a dayly soleminable
germa

Softmax temperature

8. 2. 딥 드림

딥드림은 합성곱 신경망이 학습한 표현을 사용해 예술적으로 이미지를 조작함(개인적으로 기괴함)

구글이 2015년에 Caffe 라이브러리 사용해서 구현