Python : 언어를 결정하는 방법?

IT TIP

Python : 언어를 결정하는 방법?

itqueen 2021. 1. 10. 19:43

Python : 언어를 결정하는 방법?

나는 이것을 얻고 싶다 :

Input text: "ру́сский язы́к"
Output text: "Russian" 

Input text: "中文"
Output text: "Chinese" 

Input text: "にほんご"
Output text: "Japanese" 

Input text: "العَرَبِيَّة"
Output text: "Arabic"

파이썬에서 어떻게 할 수 있습니까? 감사.

langdetect 를 보셨나요 ?

from langdetect import detect

lang = detect("Ein, zwei, drei, vier")

print lang
#output: de

TextBlob . NLTK 패키지가 필요하며 Google을 사용합니다.

from textblob import TextBlob
b = TextBlob("bonjour")
b.detect_language()

pip install textblob

다국어 . numpy 및 일부 신비한 라이브러리가 필요하지만 ~~Windows에서 작동하지 않을 것입니다~~ . (Windows의 경우 적절한 버전의 PyICU , Morfessor 및 PyCLD2 를 여기 다음 pip install downloaded_wheel.whl.) 혼합 언어로 텍스트를 감지 할 수 있습니다.

from polyglot.detect import Detector

mixed_text = u"""
China (simplified Chinese: 中国; traditional Chinese: 中國),
officially the People's Republic of China (PRC), is a sovereign state
located in East Asia.
"""
for language in Detector(mixed_text).languages:
        print(language)

# name: English     code: en       confidence:  87.0 read bytes:  1154
# name: Chinese     code: zh_Hant  confidence:   5.0 read bytes:  1755
# name: un          code: un       confidence:   0.0 read bytes:     0

pip install polyglot

종속성을 설치하려면 다음을 실행하십시오. sudo apt-get install python-numpy libicu-dev

chardet 에는 범위 (127-255)에 문자 바이트가있는 경우 언어를 감지하는 기능도 있습니다.

>>> chardet.detect("Я люблю вкусные пампушки".encode('cp1251'))
{'encoding': 'windows-1251', 'confidence': 0.9637267119204621, 'language': 'Russian'}

pip install chardet

langdetect Requires large portions of text. It uses non-deterministic approach under the hood. That means you get different results for the same text sample. Docs say you have to use following code to make it determined:
```
from langdetect import detect, DetectorFactory
DetectorFactory.seed = 0
detect('今一はお前さん')
```

pip install langdetect

guess_language 를 사용하여 매우 짧은 샘플을 감지 할 수 있습니다.this spell checker with dictionaries.

pip install guess_language-spirit

langid provides both module

import langid
langid.classify("This is a test")
# ('en', -54.41310358047485)

및 명령 줄 도구 :

    $ langid < README.md

pip install langid

FastText 는 텍스트 분류기이며 언어 분류를위한 적절한 모델 로 176 개 언어를 인식하는 데 사용할 수 있습니다 . 이 모델을 다운로드 한 다음 :

import fasttext
model = fasttext.load_model('lid.176.ftz')
print(model.predict('الشمس تشرق', k=2))  # top 2 matching languages

(('__label__ar', '__label__fa'), array([0.98124713, 0.01265871]))

pip install fasttext

문제가 있습니다. langdetect when it is being used for parallelization and it fails. But spacy_langdetect is a wrapper for that and you can use it for that purpose. You can use the following snippet as well:

import spacy
from spacy_langdetect import LanguageDetector

nlp = spacy.load("en")
nlp.add_pipe(LanguageDetector(), name="language_detector", last=True)
text = "This is English text Er lebt mit seinen Eltern und seiner Schwester in Berlin. Yo me divierto todos los días en el parque. Je m'appelle Angélica Summer, j'ai 12 ans et je suis canadienne."
doc = nlp(text)
# document level language detection. Think of it like average language of document!
print(doc._.language['language'])
# sentence level language detection
for i, sent in enumerate(doc.sents):
    print(sent, sent._.language)

You can try determining the Unicode group of chars in input string to point out type of language, (Cyrillic for Russian, for example), and then search for language-specific symbols in text.

참조 URL : https://stackoverflow.com/questions/39142778/python-how-to-determine-the-language

'IT TIP' 카테고리의 다른 글

계산에서 스타일러스 변수를 사용하는 방법은 무엇입니까? (0)	2021.01.10
Go, CamelCase 또는 Semi-CamelCase에서 함수 이름을 지정하는 방법은 무엇입니까? (0)	2021.01.10
SQLCommand에 매개 변수를 전달하는 가장 좋은 방법은 무엇입니까? (0)	2021.01.10
click ()을 사용하여 jQuery에서 함수 호출 (0)	2021.01.10
여러 줄 검색을 Perl로 대체 (0)	2021.01.10

현재글Python : 언어를 결정하는 방법?

itqueen

Python : 언어를 결정하는 방법?

Python : 언어를 결정하는 방법?

'IT TIP' 카테고리의 다른 글

'IT TIP'의 다른글

티스토리툴바

Python : 언어를 결정하는 방법?

Python : 언어를 결정하는 방법?

'IT TIP' 카테고리의 다른 글

'IT TIP'의 다른글

관련글

티스토리툴바