Python : 원시 이메일에 "Body"태그 등이없는 경우 원시 이메일에서 본문을 구문 분석하는 방법

IT TIP

Python : 원시 이메일에 "Body"태그 등이없는 경우 원시 이메일에서 본문을 구문 분석하는 방법

itqueen 2020. 11. 22. 21:00

Python : 원시 이메일에 "Body"태그 등이없는 경우 원시 이메일에서 본문을 구문 분석하는 방법

쉽게 얻을 수 있습니다

From
To
Subject

등을 통해

import email
b = email.message_from_string(a)
bbb = b['from']
ccc = b['to']

"a"이것이 다음과 같은 원시 이메일 문자열 이라고 가정합니다 .

a = """From root@a1.local.tld Thu Jul 25 19:28:59 2013
Received: from a1.local.tld (localhost [127.0.0.1])
    by a1.local.tld (8.14.4/8.14.4) with ESMTP id r6Q2SxeQ003866
    for <ooo@a1.local.tld>; Thu, 25 Jul 2013 19:28:59 -0700
Received: (from root@localhost)
    by a1.local.tld (8.14.4/8.14.4/Submit) id r6Q2Sxbh003865;
    Thu, 25 Jul 2013 19:28:59 -0700
From: root@a1.local.tld
Subject: oooooooooooooooo
To: ooo@a1.local.tld
Cc: 
X-Originating-IP: 192.168.15.127
X-Mailer: Webmin 1.420
Message-Id: <1374805739.3861@a1>
Date: Thu, 25 Jul 2013 19:28:59 -0700 (PDT)
MIME-Version: 1.0
Content-Type: multipart/mixed; boundary="bound1374805739"

This is a multi-part message in MIME format.

--bound1374805739
Content-Type: text/plain
Content-Transfer-Encoding: 7bit

ooooooooooooooooooooooooooooooooooooooooooooooo
ooooooooooooooooooooooooooooooooooooooooooooooo
ooooooooooooooooooooooooooooooooooooooooooooooo

--bound1374805739--"""

질문

Body파이썬을 통해이 이메일을 어떻게받을 수 있습니까?

지금까지 이것이 내가 아는 유일한 코드이지만 아직 테스트하지 않았습니다.

if email.is_multipart():
    for part in email.get_payload():
        print part.get_payload()
else:
    print email.get_payload()

이것이 올바른 방법입니까?

또는 다음과 같은 더 간단한 것이있을 수 있습니다.

import email
b = email.message_from_string(a)
bbb = b['body']

Message.get_payload 사용

b = email.message_from_string(a)
if b.is_multipart():
    for payload in b.get_payload():
        # if payload.is_multipart(): ...
        print payload.get_payload()
else:
    print b.get_payload()

매우 긍정적이 되려면 실제 이메일 본문으로 작업하고 (그래도 올바른 부분을 구문 분석하지 않을 가능성이 있음) 첨부 파일을 건너 뛰고 추가를 위해 일반 또는 html 부분 (필요에 따라 다름)에 집중해야합니다. 처리.

앞에서 언급 한 첨부 파일은 텍스트 / 일반 또는 텍스트 / html 부분 일 수 있으며 매우 자주 있으므로이 글 머리 기호가 아닌 샘플은 콘텐츠 처리 헤더를 확인하여 해당 첨부 파일을 건너 뜁니다.

b = email.message_from_string(a)
body = ""

if b.is_multipart():
    for part in b.walk():
        ctype = part.get_content_type()
        cdispo = str(part.get('Content-Disposition'))

        # skip any text/plain (txt) attachments
        if ctype == 'text/plain' and 'attachment' not in cdispo:
            body = part.get_payload(decode=True)  # decode
            break
# not multipart - i.e. plain text, no attachments, keeping fingers crossed
else:
    body = b.get_payload(decode=True)

BTW walk()는 mime 부분에서 놀랍도록 반복 get_payload(decode=True)되며 base64 등을 디코딩하는 작업을 수행합니다.

일부 배경-내가 암시했듯이 MIME 이메일의 멋진 세계는 메시지 본문을 "잘못"찾는 많은 함정을 제시합니다. 가장 단순한 경우에는 "text / plain"부분에 있고 get_payload ()는 매우 유혹적이지만 우리는 단순한 세상에 살지 않습니다. 종종 여러 부분 / 대체, 관련, 혼합 등의 콘텐츠로 둘러싸여 있습니다. Wikipedia는 MIME을 엄격하게 설명 하지만 아래의 모든 경우가 유효하고 일반적이라는 점을 고려하면 안전망을 모두 고려해야합니다.

매우 일반적입니다. 일반 편집기 (Gmail, Outlook)에서 서식이 지정된 텍스트를 첨부 파일과 함께 보내는 거의 대부분 :

multipart/mixed
 |
 +- multipart/related
 |   |
 |   +- multipart/alternative
 |   |   |
 |   |   +- text/plain
 |   |   +- text/html
 |   |      
 |   +- image/png
 |
 +-- application/msexcel

상대적으로 간단 함-대체 표현 :

multipart/alternative
 |
 +- text/plain
 +- text/html

좋든 나쁘 든이 구조도 유효합니다.

multipart/alternative
 |
 +- text/plain
 +- multipart/related
      |
      +- text/html
      +- image/jpeg

이것이 도움이되기를 바랍니다.

추신 : 내 요점은 이메일에 가볍게 접근하지 않는다는 것입니다.

b['body']파이썬 에는 없습니다 . get_payload를 사용해야합니다.

if isinstance(mailEntity.get_payload(), list):
    for eachPayload in mailEntity.get_payload():
        ...do things you want...
        ...real mail body is in eachPayload.get_payload()...
else:
    ...means there is only text/plain part....
    ...use mailEntity.get_payload() to get the body...

행운을 빕니다.

적절한 문서로 이메일 내용을 구문 분석 할 수있는 아주 좋은 패키지가 있습니다.

import mailparser

mail = mailparser.parse_from_file(f)
mail = mailparser.parse_from_file_obj(fp)
mail = mailparser.parse_from_string(raw_mail)
mail = mailparser.parse_from_bytes(byte_mail)

사용하는 방법:

mail.attachments: list of all attachments
mail.body
mail.to

emails가 pandas 데이터 프레임이고 emails.message 인 경우 이메일 텍스트 열

## Helper functions
def get_text_from_email(msg):
    '''To get the content from email objects'''
    parts = []
    for part in msg.walk():
        if part.get_content_type() == 'text/plain':
            parts.append( part.get_payload() )
    return ''.join(parts)

def split_email_addresses(line):
    '''To separate multiple email addresses'''
    if line:
        addrs = line.split(',')
        addrs = frozenset(map(lambda x: x.strip(), addrs))
    else:
        addrs = None
    return addrs 

import email
# Parse the emails into a list email objects
messages = list(map(email.message_from_string, emails['message']))
emails.drop('message', axis=1, inplace=True)
# Get fields from parsed email objects
keys = messages[0].keys()
for key in keys:
    emails[key] = [doc[key] for doc in messages]
# Parse content from emails
emails['content'] = list(map(get_text_from_email, messages))
# Split multiple email addresses
emails['From'] = emails['From'].map(split_email_addresses)
emails['To'] = emails['To'].map(split_email_addresses)

# Extract the root of 'file' as 'user'
emails['user'] = emails['file'].map(lambda x:x.split('/')[0])
del messages

emails.head()

매번 나를 위해 작동하는 코드는 다음과 같습니다 (Outlook 이메일의 경우).

#to read Subjects and Body of email in a folder (or subfolder)

import win32com.client  
#import package

outlook = win32com.client.Dispatch("Outlook.Application").GetNamespace("MAPI")  
#create object

#get to the desired folder (MyEmail@xyz.com is my root folder)

root_folder = 
outlook.Folders['MyEmail@xyz.com'].Folders['Inbox'].Folders['SubFolderName']

#('Inbox' and 'SubFolderName' are the subfolders)

messages = root_folder.Items

for message in messages:
if message.Unread == True:    # gets only 'Unread' emails
    subject_content = message.subject
# to store subject lines of mails

    body_content = message.body
# to store Body of mails

    print(subject_content)
    print(body_content)

    message.Unread = True         # mark the mail as 'Read'
    message = messages.GetNext()  #iterate over mails

참고 URL : https://stackoverflow.com/questions/17874360/python-how-to-parse-the-body-from-a-raw-email-given-that-raw-email-does-not

'IT TIP' 카테고리의 다른 글

Localhost 401.3 오류를 사용하는 IIS의 ASP 페이지에 권한이 없습니다. (0)	2020.11.22
'모바일 네트워크 데이터'가 활성화되었는지 여부를 확인하는 방법 (WiFi로 연결된 경우에도)? (0)	2020.11.22
qt qml과 qt quick의 차이점 (0)	2020.11.22
공간 / 방사 환경에서 C ++ 템플릿 사용이 권장되지 않는 이유는 무엇입니까? (0)	2020.11.21
IList (0)	2020.11.21

현재글Python : 원시 이메일에 "Body"태그 등이없는 경우 원시 이메일에서 본문을 구문 분석하는 방법

itqueen

Python : 원시 이메일에 "Body"태그 등이없는 경우 원시 이메일에서 본문을 구문 분석하는 방법

Python : 원시 이메일에 "Body"태그 등이없는 경우 원시 이메일에서 본문을 구문 분석하는 방법

'IT TIP' 카테고리의 다른 글

'IT TIP'의 다른글

티스토리툴바

Python : 원시 이메일에 "Body"태그 등이없는 경우 원시 이메일에서 본문을 구문 분석하는 방법

Python : 원시 이메일에 "Body"태그 등이없는 경우 원시 이메일에서 본문을 구문 분석하는 방법

'IT TIP' 카테고리의 다른 글

'IT TIP'의 다른글

관련글

티스토리툴바