Looking for data to start your new EDA (Exploratory Data Analysis) project? Or maybe just looking to automate a task that is stealing a lot of your precious time?
Extracting data from your email is a very good practice to collect data or to optimize everyday boring tasks.
In this post I will explain how I succeeded to help a human resources task, by using python, more specifically using a library called pywin32
.
To install the package, you should do it on windows, otherwise it will prompt an error. Make sure to have a virtual environment in your windows pc and apply the following command:
pip install pywin32
Start by creating an object variable, that will allow to access your email (in this case Outlook):
import win32com.client
outlook = win32com.client.Dispatch('outlook.application').GetNamespace("MAPI")
If you have several accounts, you can use the following function to choose your account:
#check how many outlook accounts there are
def get_email_accounts():
accounts = []
for account in outlook.Accounts:
accounts.append(account.DeliveryStore.DisplayName)
return accounts
To check all the main folders you can access using the object, use the following function:
#iterate to see main folders
def iterate_folder(iter = 50):
for i in range(iter):
try:
inbox = outlook.GetDefaultFolder(i)
print(i, inbox)
except:
pass
If you created extra main folders, the function above isn't able to detect them, however subfolders inside inbox, or inside any other pre defined folder, can be grabbed by using the following command:
folder = outlook.GetDefaultFolder(6).folders(<subfolder>)
You might be wondering why I chose '6' in the command above. The number '6' is the default for inbox, then I just accessed a subfolder inside the inbox, the one having the files I wanted to extract.
To grab the messages inside the subfolder use the following commands:
#the last message
messages = folder.Items
message_last = messages.GetLast()
#the next message
message_previous = messages.GetPrevious()
I will explain ahead how to loop over all the messages inside the subfolder, first I will introduce the function below, which basically looks for .pdf files inside a specific message and saves them inside a list and a directory.
def get_attachs_from_message(message, output_dir, index, iter = 4):
attachments = message.Attachments #object that contains the attachments
attachments_pdf = [] #empty list
for i in range(1, iter):
try:
attach = attachments.Item(i) # object that contains a single attachment
if '.pdf' in attach.FileName: #checks for pdf files
attachments_pdf.append(attach.Filename)
attach.SaveASFile(os.path.join(output_dir, f"{index}_{attach.FileName}"))
except:
pass
return attachments_pdf
Finally to loop over all the messages in the subfolder:
list_of_lists = []
try:
for i in range(0, 100): # choose how many messages you want to parse
print(i)
index+=1
message = messages.GetPrevious()# gets previous email message
list_of_lists.append(get_attachs_from_message(
message,
output_dir,
index = f"0{str(index)}"))
except:
pass
To wrap up, the later two functions extract the pdf files from the subfolder and saves them into a directory. Afterwards I used PyPDF2
to extract important data from the saved pdfs and save it in a .csv file.
Hoping the scripts provided can be helpful for your own needs.
Email is definitely an amazing source of data! 😎