6장 _ 시퀀스 & 텍스트 데이터 다루기

6.1 텍스트 데이터 다루기

텍스트 데이터를 다루는 분야를 일반적으로 자연어 처리(Natural Language Processing)이라고 합니다.

CNN에서 이미지 데이터를 처리해서 알겠지만 여기서도 텍스트 원본을 그대로 사용하지는 못합니다.

여기서 가장 기본적으로 텍스트 데이터를 수치화 시키는 과정을 벡터화_Vectorizing 이라고 합니다.

벡터화시키는 법에는 여러 가지가 있습니다

Text를 단어로 나누고 각 단어를 벡터화
Text를 문자로 나누고 각 문자를 벡터화
Text에서 연속된 단어나 문자의 그룹을 의미하는 n-gram을 추출하여 벡터화

위와 같이 text를 나누는 한 단위를 Token이라고 합니다.

그리고 Token을 만드는 과정을 '토큰화_Tokenization'이라고 합니다.

Token을 벡터화 시키는 과정에는 크게 두 가지가 있습니다.

<aside> 💡 1. One-hot Encoding

Token Embedding (Word Embedding) </aside>

6.1.1 One hot Encoding

# 단어 수준의 원핫인코딩

import numpy as np

samples = ['The cat sat on the mat.', 'The dog ate my homework.']

token_index = {} # dictionary 구조

for sample in samples: # 한문장 가져오기
	for word in sample.split(): #token을 나눌 기준 정하고 나눠.
		if word not in token_index: 
			token_index[word] = len(token_index) + 1

max_length = 10

results = np.zeros((len(samples), 
										max_length, 
										max(token_index.values()) + 1))

for i, sample in enumerate(samples): #원소 값과, 원소 값의 인덱스를 반환
	for j, word in list(enumerate(sample.split()))[:max_length]:
		index = token_index.get(word)
		results[i,j,index] = 1.
# i가 문장 분류
# j가 단어 분류
# index가 해당 단어가 할당된 인덱스 분류
print(results)