[PyTorch] fastText

Posted: 2025.02.13 Updated: 2025.02.13

AI 스터디를 하며 ‘파이토치 트랜스포머를 활용한 자연어 처리와 컴퓨터비전 심층학습’ 교재를 정리한 글입니다.

fastText

Word2Vec과 유사하게 문장을 벡터로 변환하는 기술 기반이지만, 하위 단어를 고려하여 높은 정확도와 성능 제공
하위 단어 고려를 위해 N-gram을 사용해 단어를 분해하고 벡터화

토큰 양 끝에 ‘<’, ‘>’를 붙여 시작과 끝 표시
분해된 토큰을 하위 단어 집합으로 분해
분해된 하위 단어 집합에 토큰 자체도 포함
각 하위 단어가 고유 벡터값을 가지게 됨

fasttext = gensim.models.FastText(
    sentences=None,
    corpus_file=None,
    vector_size=100,
    alpha=0.025,
    window=5,
    min_count=5,
    workers=3,
    sg=0,
    hs=0,
    cbow_mean=1,
    negative=5,
    ns_exponent=0.75,
    max_final_vocab=None,
    epochs=5,
    batch_words=10000,
    min_n=3,  # N-gram 최솟값
    max_n=6   # N-gram 최댓값
)

from Korpora import Korpora

corpus = Korpora.load("kornli") # 한국어 자연어 추론(두 문장 간 관계 분류) 데이터셋 (함의 관계, 중립 관계, 불일치 관계)
corpus_texts = corpus.get_all_texts() + corpus.get_all_pairs() # get_all_texts: 모든 문장 튜플로 반환, get_all_pairs: 입력 문장의 대응 문장을 튜플 형태로 반환
tokens = [sentence.split() for sentence in corpus_texts]

print(tokens[:3])

from gensim.models import FastText

fastText = FastText(
    sentences=tokens,
    vector_size=128,
    window=5,
    min_count=5,
    sg=1,
    epochs=3,
    min_n=2,
    max_n=6
)

oov_token = "사랑해요"
oov_vector = fastText.wv[oov_token] # oov 토큰 처리

print(oov_token in fastText.wv.index_to_key) # index_to_key: 학습된 단어 사전을 나타내는 리스트
print(fastText.wv.most_similar(oov_vector, topn=5)) # 사랑해요 토큰을 하위 단어로 분해하여 하위 단어의 임베딩을 모두 합해 전체 토큰 임베딩으로 계산

출력

    Korpora 는 다른 분들이 연구 목적으로 공유해주신 말뭉치들을
    손쉽게 다운로드, 사용할 수 있는 기능만을 제공합니다.

    말뭉치들을 공유해 주신 분들에게 감사드리며, 각 말뭉치 별 설명과 라이센스를 공유 드립니다.
    해당 말뭉치에 대해 자세히 알고 싶으신 분은 아래의 description 을 참고,
    해당 말뭉치를 연구/상용의 목적으로 이용하실 때에는 아래의 라이센스를 참고해 주시기 바랍니다.

    # Description
    Author : KakaoBrain
    Repository : https://github.com/kakaobrain/KorNLUDatasets
    References :
        - Ham, J., Choe, Y. J., Park, K., Choi, I., & Soh, H. (2020). KorNLI and KorSTS: New Benchmark
           Datasets for Korean Natural Language Understanding. arXiv preprint arXiv:2004.03289.
           (https://arxiv.org/abs/2004.03289)

    This is the dataset repository for our paper
    "KorNLI and KorSTS: New Benchmark Datasets for Korean Natural Language Understanding."
    (https://arxiv.org/abs/2004.03289)
    We introduce KorNLI and KorSTS, which are NLI and STS datasets in Korean.

    # License
    Creative Commons Attribution-ShareAlike license (CC BY-SA 4.0)
    Details in https://creativecommons.org/licenses/by-sa/4.0/

[['개념적으로', '크림', '스키밍은', '제품과', '지리라는', '두', '가지', '기본', '차원을', '가지고', '있다.'], ['시즌', '중에', '알고', '있는', '거', '알아?', '네', '레벨에서', '다음', '레벨로', '잃어버리는', '거야', '브레이브스가', '모팀을', '떠올리기로', '결정하면', '브레이브스가', '트리플', 'A에서', '한', '남자를', '떠올리기로', '결정하면', '더블', 'A가', '그를', '대신하러', '올라가고', 'A', '한', '명이', '그를', '대신하러', '올라간다.'], ['우리', '번호', '중', '하나가', '당신의', '지시를', '세밀하게', '수행할', '것이다.']]

False
[('사랑해', 0.9022952318191528), ('사랑', 0.8698362708091736), ('사랑한', 0.8677615523338318), ('사랑해서', 0.8532077670097351), ('사랑해.', 0.8430020809173584)]

Better Jeong

[PyTorch] fastText

fastText

출력

댓글남기기