Python

A Complete Guide to Time Series Data Visualization in Python

posted Nov 24, 2020, 7:35 AM by Chris G   [ updated Nov 24, 2020, 7:41 AM ]

From: https://towardsdatascience.com/a-complete-guide-to-time-series-data-visualization-in-python-da0ddd2cfb01

This Should Give You Enough Resources to Make Great Visuals with Time Series Data

Time series data is very important in so many different industries. It is especially important in research, financial industries, pharmaceuticals, social media, web services, and many more. Analysis of time series data is also becoming more and more essential. What is better than some good visualizations in the analysis. Any type of data analysis is not complete without some visuals. Because one good plot can provide you with a better understanding than a 20-page report. So, this article is all about time-series data visualization.

I will start with some very simple visualization and slowly will move to some advanced visualization techniques and tools

I need to make one more thing clear before starting.

‘The complete guide” in the title does not mean, it has all the visualization. There are so many visualizations available in so many different libraries that it is even not practical to have all of them in one article.

But this article should provide you with enough tools and techniques to tell a story or understand and visualize a time series data clearly. I tried to explain some simple and easy ones and some advanced techniques.

Dataset

If you are reading this for learning, the best way is to follow along and run all the code by yourself. Please feel free to download the dataset from this link:

rashida048/Datasets

Permalink GitHub is home to over 50 million developers working together to host and review code, manage projects, and…

github.com

This is a stock dataset. Let’s import some necessary packages and the dataset:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
df = pd.read_csv("stock_data.csv", parse_dates=True, index_col = "Date")
df.head()
Image for post

I used the ‘parse_dates’ parameter in the read_csv function to convert the ‘Date’ column to the DatetimeIndex format. Most of the time, Dates are stored in string format which is not the right format for time series data analysis. When it is in the DatetimeIndex format, it is a lot helpful to deal with as a time series data. You will see it soon.

I have a detailed article on Time-series data analysis. If you are new to time series data, it will be helpful if you have a look at this article first:

An Ultimate Guide to Time Series Analysis in Pandas

All the Pandas Function You Need to Perform Time Series Analysis in Pandas. You Can Use This as a Cheat Sheet as Well.

towardsdatascience.com

I explained some important Pandas function in the article above that will be used in this article. Though I will provide a brief idea here as well. But if you need an example to understand better, please feel free to have a look at that previous article.

Basic Plots First

As I said before, I want to start with some basic plots. The most basic plot should be a line plot using Pandas. I will plot the ‘Volume’ data here. See how it looks:

df['Volume'].plot()
Image for post

This is our plot of ‘Volume’ data that looks pretty busy with some big spikes. It will be a good idea to plot all the other columns as well in a plot to examine the curves of all of them at the same time.

df.plot(subplots=True, figsize=(10,12))
Image for post

The shape of the curve for ‘Open’, ‘Close’, ‘High’ and ‘Low’ data have the same shape. Only the ‘Volume’ has a different shape.

Seasonality

The line plot I used above is great for showing seasonality. Resampling for months or weeks and making bar plots is another very simple and widely used method of finding seasonality. Here I am making a bar plot of month data for 2016 and 2017. For the index, I will use [2016:]. Because our dataset contains data until 2017. So, 2016 to end should bring 2016 and 2017.

df_month = df.resample("M").mean()fig, ax = plt.subplots(figsize=(10, 6))
ax.xaxis.set_major_formatter(mdates.DateFormatter('%Y-%m'))
ax.bar(df_month['2016':].index, df_month.loc['2016':, "Volume"], width=25, align='center'
Image for post

There are 24 bars. Each bar represents a month. A huge spike in July 2017. Otherwise, there is no monthly seasonality here.

One way to find seasonality is by using a set of boxplots. Here I am going to make boxplots for each month. I will use ‘Open’, ‘Close’, ‘High’ and ‘Low’ data to make this plot.

import seaborn as sns
#start, end = '2016-01', '2016-12'
fig, axes = plt.subplots(4, 1, figsize=(10, 16), sharex=True)
for name, ax in zip(['Open', 'Close', 'High', 'Low'], axes):
sns.boxplot(data = df, x='Month', y=name, ax=ax)
ax.set_ylabel("")
ax.set_title(name)
if ax != axes[-1]:
ax.set_xlabel('')
Image for post

It shows the monthly difference in values clearly.

There are more ways to show seasonality. I discussed it one more way at the end.

Resampling and Rolling

Remember that first line plot of ‘Volume’ data above. As we discussed before, it was too busy. It can be fixed by resampling. Instead of plotting daily data, plotting monthly average will fix this issue to a large extent. I will use the df_month dataset I prepared already for the bar plot and box plots above for this.

df_month['Volume'].plot(figsize=(8, 6))
Image for post

Much more understandable and clearer! It gives a better idea about a trend in long term.

Resampling is very common in time-series data. Most of the time resampling is done to a lower frequency.

So, this article will only deal with the resampling of lower frequencies. Though resampling of higher frequency is also necessary especially for modeling purposes. Not so much in data analysis purpose.

In the ‘Volume’ data we are working on right now, we can observe some big spikes here and there. These types of spikes are not helpful for data analysis or for modeling. normally to smooth out the spikes, resampling to a lower frequency and rolling is very helpful.

Now, plot the daily data and weekly average ‘Volume’ in the same plot. First, make a weekly average dataset using the resampling method.

df_week = df.resample("W").mean()

This ‘df_week’ and ‘df_month’ will be useful for us in later visualization as well.

Let’s plot the daily and weekly data in the same plot.

start, end = '2015-01', '2015-08'
fig, ax = plt.subplots()
ax.plot(df.loc[start:end, 'Volume'], marker='.', linestyle='-', linewidth = 0.5, label='Daily', color='black')
ax.plot(df_week.loc[start:end, 'Volume'], marker='o', markersize=8, linestyle='-', label='Weekly', color='coral')
label='Monthly', color='violet')
ax.set_ylabel("Open")
ax.legend()
Image for post

Look, the weekly average plot has smaller spikes than daily data.

Rolling is another very helpful way of smoothing out the curve. It takes the average of a specified amount of data. If I want a 7-day rolling, it gives us the 7-d average data.

Let’s include the 7-d rolling data in the above plot.

df_7d_rolling = df.rolling(7, center=True).mean()start, end = '2016-06', '2017-05'fig, ax = plt.subplots()
ax.plot(df.loc[start:end, 'Volume'], marker='.', linestyle='-',
linewidth=0.5, label='Daily')
ax.plot(df_week.loc[start:end, 'Volume'], marker='o', markersize=5,
linestyle='-', label = 'Weekly mean volume')
ax.plot(df_7d_rolling.loc[start:end, 'Volume'], marker='.', linestyle='-', label='7d Rolling Average')
ax.set_ylabel('Stock Volume')
ax.legend()
Image for post

A lot going on in this one plot. But If you look at it carefully it is still understandable. If you notice 7-d rolling average is a bit smoother than the weekly average.

It is also common to take a 30-d or 365-d rolling average to make the curve smoother. Please try it yourself.

Plotting the Change

Lots of time it is more useful to see how the data change over time instead of just everyday data.

There are a few different ways to calculate and visualize the change in data.

Shift

The shift function shifts the data before or after the specified amount of time. If I do not specify the time it will shift the data by one day by default. That means you will get the previous day's data. In financial data like this one, it is helpful to see previous day data and today's data side by side.

As this article is dedicated to visualization only, I will only plot the previous day data:

df['Change'] = df.Close.div(df.Close.shift())
df['Change'].plot(figsize=(20, 8), fontsize = 16)

In the code above, .div() helps to fill up the missing data. Actually, div() means division. df. div(6) will divide each element in df by 6. But here I used ‘df.Close.shift()’. So, Each element of df will be divided by each element of ‘df.Close.shift()’. We do this to avoid the null values that are created by the ‘shift()’ operation.

If it is not yet clear to you, please look at the article I mentioned in the beginning.

Here is the output:

Image for post

You can simply take a specific period and plot to have a clearer look. This is the plot of 2017 only.

df['2017']['Change'].plot(figsize=(10, 6))
Image for post

Though the shift is useful in many ways. But I find percent change useful on many occasions.

Percent_Change

I will use the monthly data that was calculated in the beginning. This time I chose bar plots. It shows the percent change clearly. There is a percent change function available to get the percent_change data.

df_month.loc[:, 'pct_change'] = df.Close.pct_change()*100fig, ax = plt.subplots()
df_month['pct_change' ].plot(kind='bar', color='coral', ax=ax)
ax.xaxis.set_major_locator(mdates.WeekdayLocator())
ax.xaxis.set_major_formatter(mdates.DateFormatter('%b %d'))
plt.xticks(rotation=45)
ax.legend()
Image for post

I plotted the percent change in closing data here. I used monthly percent change here.

Differencing

Differencing takes the difference in values of a specified distance. By default, it’s one. If you specify 2 like “df.High.diff(2)’, it will take the difference of first and third element of ‘High’ column, second and fourth element, and so on.

It is a popular method to remove the trend in the data. The trend is not good for forecasting or modeling.

df.High.diff().plot(figsize=(10, 6))
Image for post

Expanding Window

Another way of transformation. It keeps adding the cumulative. For example, if you add an expanding function to the ‘High’ column first element remains the same. The second element becomes cumulative of the first and second element, the third element becomes cumulative of the first, second, and third element, and so on. You can use aggregate functions like mean, median, standard deviation, etc. on it too.

In that way, it will provide you with the changing mean, median, sum, or standard deviation with time. Isn’t it really useful for financial data or business sales or profit data?

fig, ax = plt.subplots()
ax = df.High.plot(label='High')
ax = df.High.expanding().mean().plot(label='High expanding mean')
ax = df.High.expanding().std().plot(label='High expanding std')
ax.legend()
Image for post

Here I added expanding mean and standard deviation. Look at the daily data and the mean. At the end of 2017, daily data shows a huge spike. But it doesn’t show a spike in the average. Probably if you take the 2017 data only, the expanding average will look different. Please feel free to try it.

Heat Map

A heat map is generally a common type of data visualization that is used everywhere. In time-series data also heat maps can be very useful.

But before diving into the heat map, we need to develop a calendar that will represent each year and month data of our dataset. Let’s see it in an example.

For this demonstration, I will import a calendar package and use the pivot table function to generate the values.

import calendar
all_month_year_df = pd.pivot_table(df, values="Open",
index=["month"],
columns=["year"],
fill_value=0,
margins=True)
named_index = [[calendar.month_abbr[i] if isinstance(i, int) else i for i in list(all_month_year_df.index)]] # name months
all_month_year_df = all_month_year_df.set_index(named_index)
all_month_year_df
Image for post

The calendar is ready with monthly average ‘Open’ data. Now, generate the heat map with it.

ax = sns.heatmap(all_month_year_df, cmap='RdYlGn_r', robust=True, fmt='.2f', 
annot=True, linewidths=.5, annot_kws={'size':11},
cbar_kws={'shrink':.8, 'label':'Open'})

ax.set_yticklabels(ax.get_yticklabels(), rotation=0, fontsize=10)
ax.set_xticklabels(ax.get_xticklabels(), rotation=0, fontsize=10)
plt.title('Average Opening', fontdict={'fontsize':18}, pad=14);
Image for post

The heat map is ready! Darker red means very high opening and dark green mean very low opening.

Decomposition

Decomposition will show the observations and these three elements in the same plot:

Trend: Consistent upward or downward slope of a time series.

Seasonality: Clear periodic pattern of a time series

Noise: Outliers or missing values

Using the stats model library, it is easy to do it:

from pylab import rcParams
import statsmodels.api as sm
rcParams['figure.figsize'] = 11, 9
decomposition = sm.tsa.seasonal_decompose(df_month['Volume'], model='Additive')
fig = decomposition.plot()
plt.show()
Image for post

Here the trend is the moving average. To give you a high-level idea of residuals in the last row, here is the general formula:

Original observations = Trend + Seasonality + Residuals

Though the documentation for decomposition itself says that it’s a very naive representation but it is still popular.

Conclusion

If you could run all the code above, Congratulation! You learned enough today to make a great level of time series of data visualization. As I mentioned in the beginning, there are a lot of cool visualization techniques available. I will write more in the future.

The Next Level of Functional Programming in Python

posted Nov 24, 2020, 7:31 AM by Chris G   [ updated Nov 24, 2020, 7:32 AM ]

From: https://towardsdatascience.com/the-next-level-of-functional-programming-in-python-bc534b9bdce1


Useful Python tips and tricks to take your functional skills to the next level

Image for post
Image by Gerd Altmann from Pixabay

preface: are you ready to take your Python skills to the next level? Can you walk through the examples in this story without squinting? Functional programming in Python might be daunting for some, but still, very fulfilling.

Python is one of the world’s most popular and in-demand programming languages. Indeed, if you are in the hot field of Data Science, Python is, most probably, your daily driver. But why?

  • Python is easy to learn for beginners, even if they don’t have a Computer Science degree
  • Python has a mature and supportive community and a huge range of modules and libraries
  • Python is a versatile, extremely hackable, flexible, efficient and reliable programming language
  • Python can help you automate menial tasks, such as creating, moving and renaming files or folders

However, this story is not for beginners. Some experience with the language is required to understand the examples and their value. Thus, without further ado let’s dive into the itertools module of Python, to discover a few hidden-in-plain-sight treasures of the language.

Learning Rate is a newsletter for those who are curious about the world of AI and MLOps. You’ll hear from me every Friday with updates and thoughts on the latest AI news and articles. Subscribe here!

Definitions

Every function that we examine here is part of Python’s itertools library. That means that to import them you just write from itertools import x, where x is the function at hand. You will see that most of them, do not do anything useful by themselves, but the magic starts when you compose them.

Note that the examples work with Python 3.x

  • count: count takes a start and a step and generates all numbers from start onwards. For example:
count(start=10, step=1) --> 10 11 12 13 14 ...
  • islice: islice returns a lazy slice out of a sequence. You can specify, where to start and where to stop, for example:
islice('ABCDEFG', 2, None, 1) --> C D E F G
  • tee: tee splits an iterator n into two or more copies. Its definition is:
tee(it, [n=2])
  • repeat: repeat simply repeats an element n times. If you omit n it will repeat the element forever. For example:
repeat(elem=10, n=3) --> 10 10 10
  • cycle: cycle repeats the elements of a sequence over and over. For example:
cycle('ABCD') --> A B C D A B C D ...
  • chain: chain takes sequences as input and goes over the elements of each sequence, one at a time. For example:
chain('ABC', 'DEF') --> A B C D E F
  • accumulate: accumulate takes a sequence and a function, add by default, and generates the running totals. For example:
accumulate(p=[1,2,3,4,5], func=add) --> 1 3 6 10 15

As we said, to make something out of these primitive functions, we would have to compose them and create higher abstractions. So let’s do that and create our itertools extensions.

Custom extensions

To begin with, let’s create a function to realize the first n elements of a sequence, for instance, a generator.

def take(it, n):
return [x for x in islice(it, n)]

As we do that, let’s also create a function that removes the first n elements of a sequence.

def drop(it, n):
return islice(it, n, None)

Now, we are able to create our own primitive functions, that return the head and tail of a sequence.

head = next
tail = partial(drop, n=1)

If you haven’t use partial before, you can find it in the functools namespace. In this example, it takes the drop function, passes 1 as the argument n, and returns a new function that expects the it argument. Thus, if you pass a list in the tail function, it chops off the first element and it returns the remaining items.

Next, we want to create a new function, that we will call compose. It works like that:

compose(x, f) --> x, f(x), f(f(x)), f(f(f(x))), ... 

So, how can we do that? It turns out we can do this very efficiently now that we have yielded the power of the repeat and accumulate functions:

def compose(x, f):
return accumulate(repeat(x), lambda acc, x: f(acc))

I admit it looks a bit complex, but if you squint hard enough, you’ll understand its inner workings. repeat generates an infinite sequence of xs and then, accumulate takes each of them and calculates the running total according to some function f that we define. So, the lambda function, takes as inputs the previously accumulated value, and a new x that we ignore, and passes the accumulated value through the function, again and again. To test it, we could do something like that:

take(compose(2, lambda x: x**2), 5) --> [2, 4, 16, 256, 65536]

Putting it all together

Now, we are onto something here! So let’s use some functions that we have defined to create the infamous Fibonacci numbers sequence.

def next_num(pair):
x, y = pair
return y, x + y
def fibonacci():
return (y for x, y in compose((0, 1), next_num))
Image for post
Down the Fibonacci hole — Photo by Ludde Lorentz on Unsplash

Let’s walk it through step by step. The compose function creates a sequence of tuples. The first tuple is (0, 1). Thus, the fibonacci function yields the y value of the tuple, which is 1. It then pass the tuple (0, 1) to the next_num function, which returns 1 and 0 + 1 = 1. So, the fibonacci function yields also 1. Finally, the compose function passes the tuple (1, 1) to the next_num function, which returns 1 and 1 + 1 = 2. The fibonacci function in turn yields 2.

take(fibonacci(), 10) --> [1, 1, 2, 3, 5, 8, 13, 21, 34, 55]

This is a super-efficient implementation of the Fibonacci sequence, which runs under 35μs! It showcases the power of good, functional design!

Develop and sell a Python API — from start to end tutorial

posted Sep 20, 2020, 3:05 PM by Chris G   [ updated Sep 20, 2020, 3:08 PM ]

From: https://towardsdatascience.com/develop-and-sell-a-python-api-from-start-to-end-tutorial-9a038e433966

You can also read this article directly on Github (for better code formatting)

Develop and sell a Python API — from start to end tutorial

The article paints a picture for developing a Python API from start to end and provides help in more difficult areas.

I recently read a blog post about setting up your own API and selling it.

I was quite inspired and wanted to test if it works. In just 5 days I was able to create an API from start to end. So I thought I share issues I came across, elaborate on concepts that the article was introducing, and provide a quick checklist to build something yourself. All of this by developing another API.

Table of Contents

About this article

This article can be considered as a tutorial and comprehension of other articles (listed in my “Inspiration” section).

It paints a picture for developing a Python API from start to finish and provides help in more difficult areas like the setup with AWS and Rapidapi.

I thought it will be useful for other people trying to do the same. I had some issues on the way, so I thought I share my approach. It is also a great way to build side projects and maybe even make some money.

As the Table of content shows, it consists of 4 major parts, namely:

  1. Setting up the environment
  2. Creating a problem solution with Python
  3. Setting up AWS
  4. Setting up Rapidapi

You will find all my code open sourced on Github:

You will find the end result here on Rapidapi:

If you found this article helpful let me know and/or buy the functionality on Rapidapi to show support.

Disclaimer

I am not associated with any of the services I use in this article.

I do not consider myself an expert. If you have the feeling that I am missing important steps or neglected something, consider pointing it out in the comment section or get in touch with me. Also, always make sure to monitor your AWS costs to not pay for things you do not know about.

I am always happy for constructive input and how to improve.

Stack used

We will use

  • Github (Code hosting),
  • Anaconda (Dependency and environment management),
  • Jupyter Notebook (code development and documentation),
  • Python (programming language),
  • AWS (deployment),
  • Rapidapi (market to sell)

1. Create project formalities

It’s always the same but necessary. I do it along with these steps:

  1. Create a local folder mkdir NAME
  2. Create a new repository on Github with NAME
  3. Create conda environment conda create --name NAME python=3.7
  4. Activate conda environment conda activate PATH_TO_ENVIRONMENT
  5. Create git repo git init
  6. Connect to Github repo. Add Readme file, commit it and
git remote add origin URL_TO_GIT_REPO
git push -u origin master

Now we have:

  • local folder
  • github repository
  • anaconda virtual environment
  • git version control

2. Create a solution for a problem

Then we need to create a solution to some problem. For the sake of demonstration, I will show how to convert an excel csv file into other formats. The basic functionality will be coded and tested in a Jupyter Notebook first.

Install packages

Install jupyter notebook and jupytext:

pip install notebook jupytext

sets a hook in .git/hooks/pre-commit for tracking the notebook changes in git properly:

#!/bin/shjupytext --from ipynb --to jupytext_conversion//py:light --pre-commit

Develop a solution to a problem

pip install pandas requests

Add a .gitignore file and add the data folder (data/) to not upload the data to the hosting.

Download data

Download an example dataset (titanic dataset) and save it into a data folder:

def download(url: str, dest_folder: str):
if not os.path.exists(dest_folder):
os.makedirs(dest_folder)
filename = url.split('/')[-1].replace(" ", "_")
file_path = os.path.join(dest_folder, filename)
r = requests.get(url, stream=True)
if r.ok:
print("saving to", os.path.abspath(file_path))
with open(file_path, 'wb') as f:
for chunk in r.iter_content(chunk_size=1024 * 8):
if chunk:
f.write(chunk)
f.flush()
os.fsync(f.fileno())
else:
print("Download failed: status code {}\n{}".format(r.status_code, r.text))
url_to_titanic_data = 'https://web.stanford.edu/class/archive/cs/cs109/cs109.1166/stuff/titanic.csv'download(url_to_titanic_data,'./data')

Create functionality

Transform format

df = pd.read_csv('./data/titanic.csv')
df.to_json(r'./data/titanic.json')
Image for post
Conversion example in Jupyter Notebook

Build server to execute a function with REST

After developing the functionality in jupyter notebook we want to actually provide the functionality in a python app.

There are ways to use parts of the jupyter notebook, but for the sake of simplicity we create it again now.

Add an app.py file.

We want the user to upload an excel file and return the file converted into JSON for example.

Browsing through the internet we can see that there are already packages that work with flask and excel formats. So let's use them.

pip install Flask

Start Flask server with

env FLASK_APP=app.py FLASK_ENV=development flask run

Tipp: Test your backend functionality with Postman. It is easy to set up and allows us to test the backend functionality quickly. Uploading an excel is done in the “form-data” tab:

Image for post
Testing backend with Postman

Here you can see the uploaded titanic csv file and the returned column names of the dataset.

Now we simply write the function to transform the excel into json, like:

import json
import pandas as pd
from flask import Flask, request
app = Flask(__name__)@app.route('/get_json', methods=['GET', 'POST'])
def upload_file():
if request.method == 'POST':
provided_data = request.files.get('file')
if provided_data is None:
return 'Please enter valid excel format ', 400
data = provided_data
df = pd.read_csv(data)
transformed = df.to_json()
result = {
'result': transformed,
}
json.dumps(result) return result
if __name__ == '__main__':
app.run()

(Check out my repository for the full code.)

Now we have the functionality to transform csv files into json for example.

3. Deploy to AWS

After developing it locally we want to get it in the cloud.

Set up zappa

After we created the app locally we need to start setting up the hosting on a real server. We will use zappa.

Zappa makes it super easy to build and deploy server-less, event-driven Python applications (including, but not limited to, WSGI web apps) on AWS Lambda + API Gateway. Think of it as “serverless” web hosting for your Python apps. That means infinite scaling, zero downtime, zero maintenance — and at a fraction of the cost of your current deployments!

pip install zappa

As we are using a conda environment we need to specify it:

which python

will give you /Users/XXX/opt/anaconda3/envs/XXXX/bin/python (for Mac)

remove the bin/python/ and export

export VIRTUAL_ENV=/Users/XXXX/opt/anaconda3/envs/XXXXX/

Now we can do

zappa init

to set up the config.

Just click through everything and you will have a zappa_settings.json like

{
"dev": {
"app_function": "app.app",
"aws_region": "eu-central-1",
"profile_name": "default",
"project_name": "pandas-transform-format",
"runtime": "python3.7",
"s3_bucket": "zappa-pandas-transform-format"
}
}

Note that we are not yet ready to deploy. First, we need to get some AWS credentials.

Set up AWS

AWS credentials

First, you need te get an AWS access key id and access key

You might think it is as easy as:

To get the credentials you need to

  • Go to: http://aws.amazon.com/
  • Sign Up & create a new account (they’ll give you the option for 1 year trial or similar)
  • Go to your AWS account overview
  • Account menu; sub-menu: Security Credentials

But no. There is more to permissions in AWS!

Set up credentials with users and roles in IAM

I found this article from Peter Kazarinoff to be very helpful. He explains the next section in great detail. My following bullet point approach is a quick summary and I often quote his steps. Please check out his article for more details if you are stuck somewhere.

I break it down as simple as possible:

  1. Within the AWS Console, type IAM into the search box. IAM is the AWS user and permissions dashboard.
  2. Create a group
  3. Give your group a name (for example zappa_group)
  4. Create our own specific inline policy for your group
  5. In the Permissions tab, under the Inline Policies section, choose the link to create a new Inline Policy
  6. In the Set Permissions screen, click the Custom Policy radio button and click the “Select” button on the right.
  7. Create a Custom Policy written in json format
  8. Read through and copy a policy discussed here: https://github.com/Miserlou/Zappa/issues/244
  9. Scroll down to “My Custom policy” see a snippet of my policy.
  10. After pasting and modifying the json with your AWS Account Number, click the “Validate Policy” button to ensure you copied valid json. Then click the “Apply Policy” button to attach the inline policy to the group.
  11. Create a user and add the user to the group
  12. Back at the IAM Dashboard, create a new user with the “Users” left-hand menu option and the “Add User” button.
  13. In the Add user screen, give your new user a name and select the Access Type for Programmatic access. Then click the “Next: Permissions” button.
  14. In the Set permissions screen, select the group you created earlier in the Add user to group section and click “Next: Tags”.
  15. Tags are optional. Add tags if you want, then click “Next: Review”.
  16. Review the user details and click “Create user”
  17. Copy the user’s keys
  18. Don’t close the AWS IAM window yet. In the next step, you will copy and paste these keys into a file. At this point, it’s not a bad idea to copy and save these keys into a text file in a secure location. Make sure you don’t save keys under version control.

My Custom policy:

{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": [
"iam:AttachRolePolicy",
"iam:GetRole",
"iam:CreateRole",
"iam:PassRole",
"iam:PutRolePolicy"
],
"Resource": [
"arn:aws:iam::XXXXXXXXXXXXXXXX:role/*-ZappaLambdaExecutionRole"
]
},
{
"Effect": "Allow",
"Action": [
"lambda:CreateFunction",
"lambda:ListVersionsByFunction",
"logs:DescribeLogStreams",
"events:PutRule",
"lambda:GetFunctionConfiguration",
"cloudformation:DescribeStackResource",
"apigateway:DELETE",
"apigateway:UpdateRestApiPolicy",
"events:ListRuleNamesByTarget",
"apigateway:PATCH",
"events:ListRules",
"cloudformation:UpdateStack",
"lambda:DeleteFunction",
"events:RemoveTargets",
"logs:FilterLogEvents",
"apigateway:GET",
"lambda:GetAlias",
"events:ListTargetsByRule",
"cloudformation:ListStackResources",
"events:DescribeRule",
"logs:DeleteLogGroup",
"apigateway:PUT",
"lambda:InvokeFunction",
"lambda:GetFunction",
"lambda:UpdateFunctionConfiguration",
"cloudformation:DescribeStacks",
"lambda:UpdateFunctionCode",
"lambda:DeleteFunctionConcurrency",
"events:DeleteRule",
"events:PutTargets",
"lambda:AddPermission",
"cloudformation:CreateStack",
"cloudformation:DeleteStack",
"apigateway:POST",
"lambda:RemovePermission",
"lambda:GetPolicy"
],
"Resource": "*"
},
{
"Effect": "Allow",
"Action": [
"s3:ListBucketMultipartUploads",
"s3:CreateBucket",
"s3:ListBucket"
],
"Resource": "arn:aws:s3:::zappa-*"
},
{
"Effect": "Allow",
"Action": [
"s3:PutObject",
"s3:GetObject",
"s3:AbortMultipartUpload",
"s3:DeleteObject",
"s3:ListMultipartUploadParts"
],
"Resource": "arn:aws:s3:::zappa-*/*"
}
]
}

NOTE: Replace XXXXXXXXXXX in the inline policy by your AWS Account Number.

Your AWS Account Number can be found by clicking “Support → “Support Center. Your Account Number is listed in the Support Center on the upper left-hand side. The json above is what worked for me. But, I expect this set of security permissions may be too open. To increase security, you could slowly pare down the permissions and see if Zappa still deploys. The settings above are the ones that finally worked for me. You can dig through this discussion on GitHub if you want to learn more about specific AWS permissions needed to run Zappa: https://github.com/Miserlou/Zappa/issues/244.

Add credentials in your project

Create a .aws/credentials folder in your root with

mkdir ~/.aws
code open ~/.aws/credentials

and paste your credentials from AWS

[dev]
aws_access_key_id = YOUR_KEY
aws_secret_access_key = YOUR_KEY

Same with the config

code open ~/.aws/config[default]
region = YOUR_REGION (eg. eu-central-1)

Note that code is for opening a folder with vscode, my editor of choice.

Save the AWS access key id and secret access key assigned to the user you created in the file ~/.aws/credentials. Note the .aws/ directory needs to be in your home directory and the credentials file has no file extension.

Now you can do deploy your API with

zappa deploy dev
Image for post
Deploying app with zappa

There shouldn’t be any errors anymore. However, if there are still some, you can debug with:

zappa status
zappa tail

The most common errors are permission related (then check your permission policy) or about python libraries that are incompatible. Either way, zappa will provide good enough error messages for debugging.

If you update your code don’t forget to update the deployment as well with

zappa update dev

AWS API Gateway

To set up the API on a market we need to first restrict its usage with an API-key and then set it up on the market platform.

I found this article from Nagesh Bansal to be helpful. He explains the next section in great detail. My following bullet point approach is a quick summary and I often quote his steps. Please check out his article for more details if you are stuck somewhere.

Again, I break it down:

  1. go to your AWS Console and go to API gateway
  2. click on your API
  3. we want to create an x-api-key to restrict undesired access to the API and also have a metered usage
  4. create a Usage plan for the API, with the desired throttle and quota limits
  5. create an associated API stage
  6. add an API key
  7. in the API key overview section, click “show” at the API key and copy it
  8. then associate the API with the key and discard all requests that come without the key
  9. go back to the API overview. under resources, click the “/ any” go to the “method request”. then in settings, set “API key required” to true
  10. do the same for the “/{proxy+} Methods”

it looks like this

Image for post
Set restrictions in AWS API Gateway

Now you have restricted access to your API.

4. Set up Rapidapi

Create API on Rapidapi

  1. Go to “My APIs” and “Add new API”
  2. Add the name, description, and category. Note that you cannot change your API name afterward anymore
  3. In settings, add the URL of your AWS API (it was displayed when you deployed with zappa)
  4. In the section “Access Control” under “Transformations”, add the API key you added in AWS
Image for post
Access Control in Rapidapi

5. In the security tab you can check everything

6. Then go to “endpoints” to add the routes from you Python app by clicking “create REST endpoint”

Image for post
Add a REST endpoint

7. Add an image for your API

8. Set a pricing plan. Rapidapi published an own article on pricing options and strategies. As they conclude, it is up to your preferences and product on how to price it.

9. I created a freemium pricing plan. The reason for that is that I want to give the chance to test it without cost, but add a price for using it regularly. Also, I want to create a plan for supporting my work. For example:

Image for post
Set price plans

10. Create some docs and a tutorial. This is pretty self-explaining. It is encouraged to do so as it is easier for people to use your API if it is documented properly.

11. The last step is to make your API publicly available. But before you do that it is useful to test it for yourself.

Test your own API

Create a private plan for testing

Having set up everything, you of course should test it with the provided snippets. This step is not trivial and I had to contact the support to understand it. Now I am simplifying it here.

Create a private plan for yourself, by setting no limits.

The go to the “Users” section of your API, then to “Users on free plans”, select yourself and “invite” you to the private plan.

Image for post
Add yourself to your private plan

Now you are subscribed to your own private plan and can test the functionality with the provided snippets.

Test endpoint with Rapidapi

Upload an example excel file and click on “test endpoint”. Then you will get a 200 ok response.

Image for post
Test an endpoint in Rapidapi

Create code to consume API

To consume the API now you can simply copy the snippet that Rapidapi provides. For example with Python and the requests library:

import requestsurl = "https://excel-to-other-formats.p.rapidapi.com/upload"payload = ""
headers = {
'x-rapidapi-host': "excel-to-other-formats.p.rapidapi.com",
'x-rapidapi-key': "YOUR_KEY",
'content-type': "multipart/form-data"
}
response = requests.request("POST", url, data=payload, headers=headers)print(response.text)

End result

Inspiration

The article “API as a product. How to sell your work when all you know is a back-end” by Artem provided a great idea, namely to

Make an API that solves a problem

Deploy it with a serverless architecture

Distribute through an API Marketplace

For the setting everything I found the articles from Nagesh Bansal very helpful:

Also this article from Peter Kazarinoff: https://pythonforundergradengineers.com/deploy-serverless-web-app-aws-lambda-zappa.html

I encourage you to have a look at those articles as well.

You can also read my article directly on Github (for better code formatting)


The right and wrong way to set Python 3 as default on a Mac

posted Aug 9, 2020, 10:11 AM by Chris G   [ updated Aug 9, 2020, 10:12 AM ]

There are several ways to get started with Python 3 on macOS, but one way is better than the others.

What's so hard about this?

The version of Python that ships with macOS is well out of date from what Python recommends using for development. Python runtimes are also comically challenging at times, as noted by XKCD.

python_environment_xkcd.png

Python environment webcomic by xkcd

So what's the plan? 

Python GUI For Humans! Create full function Python interfaces with PySimpleGUI

posted Aug 9, 2020, 10:04 AM by Chris G   [ updated Aug 9, 2020, 10:06 AM ]


$ 𝚙𝚒𝚙 𝚒𝚗𝚜𝚝𝚊𝚕𝚕 𝚙𝚢𝚜𝚒𝚖𝚙𝚕𝚎𝚐𝚞𝚒

GitHub https://github.com/PySimpleGUI/
Docs https://pysimplegui.readthedocs.io/en/latest/

pysimplegui_logo

This Code

import PySimpleGUI as sg

sg.theme('DarkAmber')   # Add a touch of color
# All the stuff inside your window.
layout = [  [sg.Text('Some text on Row 1')],
            [sg.Text('Enter something on Row 2'), sg.InputText()],
            [sg.Button('Ok'), sg.Button('Cancel')] ]

# Create the Window
window = sg.Window('Window Title', layout)
# Event Loop to process "events" and get the "values" of the inputs
while True:
    event, values = window.read()
    if event == sg.WIN_CLOSED or event == 'Cancel': # if user closes window or clicks cancel
        break
    print('You entered ', values[0])

window.close()

Makes This Window

and returns the value input as well as the button clicked.

image

Activate link to view larger image.

This Code

import PySimpleGUI as sg

sg.theme('DarkAmber')   # Add a touch of color
# All the stuff inside your window.
layout = [  [sg.Text('Some text on Row 1')],
            [sg.Text('Enter something on Row 2'), sg.InputText()],
            [sg.Button('Ok'), sg.Button('Cancel')] ]

# Create the Window
window = sg.Window('Window Title', layout)
# Event Loop to process "events" and get the "values" of the inputs
while True:
    event, values = window.read()
    if event == sg.WIN_CLOSED or event == 'Cancel': # if user closes window or clicks cancel
        break
    print('You entered ', values[0])

window.close()

Makes This Window

and returns the value input as well as the button clicked.

image

8 Advanced Tips to Master Python Strings

posted Jul 4, 2020, 12:27 PM by Chris G   [ updated Jul 4, 2020, 12:35 PM ]



Learn These Tips to Master Python Strings

Python strings appear simple, but they're incredibly flexible and they’re everywhere!

It may not seem like strings are something to master for data science, but with the abundance of unstructured, qualitative data available, it’s incredibly helpful to dive into strings!

1. Check for Membership with ‘in’

When working with unstructured data, it can be really helpful to identify particular words or other substrings in a larger string. The easiest way to do this is by using the in operator.

Say you’re working with a list, series, or dataframe column, and you want to identify whether a substring exists in a string.

In the example below, you have a list of different regions and want to know if the string “West” is in each list item.


sample_list = ['North West', 'West', 'North East', 'East', 'South', 'North'] is_west = ['Yes' if 'West' in location else 'No' for location in sample_list] print(is_west) # Returns: # ['Yes', 'Yes', 'No', 'No', 'No', 'No']


Checking string membership. Source: Nik Piepenbreier

2. Do Magic with F-Strings

F-strings were introduced in Python 3.6 and they don’t get enough credit.

There’s a reason I say they’re magic. They:

  • Allow for much more flexibility,
  • Are much more readable than other methods, and
  • Execute much faster.

But what are they? F-strings (or formatted string literals) allow you to place variables (or any expression) into strings. The expressions are then executed at run time.

To write an f-string, prefix a string with ‘f’.

Let’s take a look at an example:


name = 'Nik' birthyear = 1987 print(f'My name is {name} and I am {2020-birthyear} years old.')


F-strings are amazing. Source: Nik Piepenbreier 

3. Reverse a String with [::-1]

Strings can be reversed (like other iterables), by slicing the string. To reverse any iterable, simply use [::-1].

The -1 acts as a step argument, by which Python starts at the last value and increments by -1:


string = 'pythonisfun' print(string[::-1]) # Returns: nufsinohtyp


Reversing a string. Source: Nik Piepenbreier

4. Replace Substrings with .replace()

To replace substrings, you can use the replace method. This works for any type of string, including a simple space (as Python doesn’t have built-in methods for removing spaces).

Let’s take a look at an example:


sample = 'Python is kind of fun.' print(sample.replace('kind of', 'super')) # Returns: # Python is super fun.


Replacing substrings. Source: Nik Piepenbreier

5. Iterating over a String with a For-Loop

Python strings are iterable objects (just like lists, sets, etc.).

If you wanted to return each letter of a string, you could write:


sample = 'python' for letter in sample: print(letter) # Returns: # p # y # t # h # o # n


6. Format Strings with .upper(), .lower(), and .title()

Python strings can be a little quirky. You might get yourself a file in all caps, all lower cases, etc. And you might need to actually format these for presenting them later on.

  • .upper() will return a string with all characters in upper case
  • .lower() will return a string with all characters in lower case
  • .title() will capitalize each word of a string.

Let’s see these in action:


sample = 'THIS is a StRiNg' print(sample.upper()) print(sample.lower()) print(sample.title()) # Returns: # THIS IS A STRING # this is a string # This Is A String


7. Check for Palindromes and Anagrams

Combining what you’ve learned so far, you can easily check if a string is a Palindrome by using the [::-1] slice.

A word or phrase is a palindrome if it’s the same spelled forward as it is backward.

Similarly, you can return a sorted version of a string by using the sorted function. If two sorted strings are the same, they are anagrams:


string = 'taco cat' def palindrome(string_to_check): if string.lower().replace(' ', '') == string.lower().replace(' ', '')[::-1]: print("You found a palindrome!") else: print("Your string isn't a palindrome") palindrome(string) # Returns: # You found a palindrome!


An anagram is a word or phrase that is formed by rearranging another word. In short, two words are anagrams if they have the same letters.

If you want to see if two words are anagrams, you can sort the two words and see if they are the same:


def anagram(word1, word2): if sorted(word1) == sorted(word2): print(f"{word1} and {word2} are anagrams!") else: print(f"{word1} and {word2} aren't anagrams!") anagram('silent', 'listen') # Returns: # silent and listen are anagrams!


8. Split a String with .split()

Say you’re given a string that contains multiple pieces of data. It can be helpful to split this string to parse out individual pieces of data.

In the example below, a string contains the region, the last name of a sales rep, as well as an order number.

You can use .split() to split these values:


order_text = 'north-doe-001' print(order_text.split('-')) # Returns: # ['north', 'doe', '001']



Geo Heatmap

posted Jul 4, 2020, 12:15 PM by Chris G   [ updated Jul 4, 2020, 12:15 PM ]


screenshot

This is a script that generates an interactive geo heatmap from your Google location history data using Python, Folium and OpenStreetMap.




8 Advanced Python Tricks Used by Seasoned Programmers

posted Jun 27, 2020, 10:10 AM by Chris G   [ updated Jun 27, 2020, 1:36 PM ]

Apply these tricks in your Python code to make it more concise and performant

Here are eight neat Python tricks some I’m sure you haven’t seen before. Apply these tricks in your Python code to make it more concise and performant!

1. Sorting Objects by Multiple Keys

Suppose we want to sort the following list of dictionaries:

people = [
{ 'name': 'John', "age": 64 },
{ 'name': 'Janet', "age": 34 },
{ 'name': 'Ed', "age": 24 },
{ 'name': 'Sara', "age": 64 },
{ 'name': 'John', "age": 32 },
{ 'name': 'Jane', "age": 34 },
{ 'name': 'John', "age": 99 },
]

But we don’t just want to sort it by name or age, we want to sort it by both fields. In SQL, this would be a query like:

SELECT * FROM people ORDER by name, age

There’s actually a very simple solution to this problem, thanks to Python’s guarantee that sort functions offer a stable sort order. This means items that compare equal retain their original order.

To achieve sorting by name and age, we can do this:

import operator
people.sort(key=operator.itemgetter('age'))
people.sort(key=operator.itemgetter('name'))

Notice how I reversed the order. We first sort by age, and then by name. With operator.itemgetter() we get the age and name fields from each dictionary inside the list in a concise way.

This gives us the result we were looking for:

[
{'name': 'Ed', 'age': 24},
{'name': 'Jane', 'age': 34},
{'name': 'Janet','age': 34},
{'name': 'John', 'age': 32},
{'name': 'John', 'age': 64},
{'name': 'John', 'age': 99},
{'name': 'Sara', 'age': 64}
]

The names are sorted primarily, the ages are sorted if the name is the same. So all the Johns are grouped together, sorted by age.

Inspired by this StackOverflow question.


2. List Comprehensions

A list comprehension can replace ugly for loops used to fill a list. The basic syntax for a list comprehension is:

[ expression for item in list if conditional ]

A very basic example to fill a list with a sequence of numbers:

mylist = [i for i in range(10)]
print(mylist)
# [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]

And because you can use an expression, you can also do some math:

squares = [x**2 for x in range(10)]
print(squares)
# [0, 1, 4, 9, 16, 25, 36, 49, 64, 81]

Or even call an external function:

def some_function(a):
    return (a + 5) / 2
    
my_formula = [some_function(i) for i in range(10)]
print(my_formula)
# [2.5, 3.0, 3.5, 4.0, 4.5, 5.0, 5.5, 6.0, 6.5, 7.0]

And finally, you can use the ‘if’ to filter the list. In this case, we only keep the values that are dividable by 2:

filtered = [i for i in range(20) if i%2==0]
print(filtered)
# [0, 2, 4, 6, 8, 10, 12, 14, 16, 18]

3. Check memory usage of your objects

With sys.getsizeof() you can check the memory usage of an object:

import sys

mylist = range(0, 10000)
print(sys.getsizeof(mylist))
# 48

Woah… wait… why is this huge list only 48 bytes?

It’s because the range function returns a class that only behaves like a list. A range is a lot more memory efficient than using an actual list of numbers.

You can see for yourself by using a list comprehension to create an actual list of numbers from the same range:

import sys

myreallist = [x for x in range(0, 10000)]
print(sys.getsizeof(myreallist))
# 87632

So, by playing around with sys.getsizeof() you can learn more about Python and your memory usage.


4. Data classes

Since version 3.7, Python offers data classes. There are several advantages over regular classes or other alternatives like returning multiple values or dictionaries:

  • a data class requires a minimal amount of code
  • you can compare data classes because __eq__ is implemented for you
  • you can easily print a data class for debugging because __repr__ is implemented as well
  • data classes require type hints, reduced the chances of bugs

Here’s an example of a data class at work:

from dataclasses import dataclass

@dataclass
class Card:
    rank: str
    suit: str
    
card = Card("Q", "hearts")

print(card == card)
# True

print(card.rank)
# 'Q'

print(card)
Card(rank='Q', suit='hearts')

An in-depth guide can be found here.


5. The attrs Package

Instead of data classes, you can use attrs. There are two reasons to choose attrs:

  • You are using a Python version older than 3.7
  • You want more features

Theattrs package supports all mainstream Python versions, including CPython 2.7 and PyPy. Some of the extras attrs offers over regular data classes are validators, and converters. Let’s look at some example code:

@attrs
class Person(object):
    name = attrib(default='John')
    surname = attrib(default='Doe')
    age = attrib(init=False)
    
p = Person()
print(p)
p = Person('Bill', 'Gates')
p.age = 60
print(p)

# Output: 
#   Person(name='John', surname='Doe', age=NOTHING)
#   Person(name='Bill', surname='Gates', age=60)

The authors of attrs have, in fact, worked on the PEP that introduced data classes. Data classes are intentionally kept simpler (easier to understand), while attrs offers the full range of features you might want!

For more examples, check out the attrs examples page.


6. Merging dictionaries (Python 3.5+)

Since Python 3.5, it’s easier to merge dictionaries:

dict1 = { 'a': 1, 'b': 2 }
dict2 = { 'b': 3, 'c': 4 }
merged = { **dict1, **dict2 }
print (merged)
# {'a': 1, 'b': 3, 'c': 4}

If there are overlapping keys, the keys from the first dictionary will be overwritten.

In Python 3.9, merging dictionaries becomes even cleaner. The above merge in Python 3.9 can be rewritten as:

merged = dict1 | dict2

7. Find the Most Frequently Occurring Value

To find the most frequently occurring value in a list or string:

test = [1, 2, 3, 4, 2, 2, 3, 1, 4, 4, 4]
print(max(set(test), key = test.count))
# 4

Do you understand why this works? Try to figure it out for yourself before reading on.

You didn’t try, did you? I’ll tell you anyway:

  • max() will return the highest value in a list. The key argument takes a single argument function to customize the sort order, in this case, it’s test.count. The function is applied to each item on the iterable.
  • test.count is a built-in function of list. It takes an argument and will count the number of occurrences for that argument. So test.count(1) will return 2 and test.count(4) returns 4.
  • set(test) returns all the unique values from test, so {1, 2, 3, 4}

So what we do in this single line of code is take all the unique values of test, which is {1, 2, 3, 4}. Next, max will apply the list.count function to them and return the maximum value.

And no — I didn’t invent this one-liner.

Update: a number of commenters rightfully pointed out that there’s a much more efficient way to do this:

from collections import Counter
Counter(test).most_common(1)
# [4: 4]

8. Return Multiple Values

Functions in Python can return more than one variable without the need for a dictionary, a list, or a class. It works like this:

def get_user(id):
    # fetch user from database
    # ....
    return name, birthdate

name, birthdate = get_user(4)

This is alright for a limited number of return values. But anything past 3 values should be put into a (data) class.

Python | Read Text from Image with One Line Code

posted Jun 20, 2020, 1:24 PM by Chris G   [ updated Jun 20, 2020, 1:25 PM ]

Python | Read Text from Image with One Line Code

Dealing with images is not a trivial task. To you, as a human, it’s easy to look at something and immediately know what is it you’re looking at. But computers don’t work that way.

Tasks that are too hard for you, like complex arithmetics, and math in general, is something that a computer chews without breaking a sweat. But here the exact opposite applies — tasks that are trivial to you, like recognizing is it cat or dog in an image are really hard for a computer. In a way, we are a perfect match. For now at least.

While image classification and tasks that involve some level of computer vision might require a good bit of code and a solid understanding, reading text from a somewhat well-formatted image turns out to be a one-liner in Python —and can be applied to so many real-life problems.

And in today’s post, I want to prove that claim. There will be some installation to go though, but it shouldn’t take much time. These are the libraries you’ll need:

  • OpenCV
  • PyTesseract

I don’t want to prolonge this intro part anymore, so why don’t we jump into the good stuff now.

OpenCV

Now, this library will only be used to load the images(s), you don’t actually need to have a solid understanding of it beforehand (although it might be helpful, you’ll see why).

According to the official documentation:

OpenCV (Open Source Computer Vision Library) is an open source computer vision and machine learning software library. OpenCV was built to provide a common infrastructure for computer vision applications and to accelerate the use of machine perception in the commercial products. Being a BSD-licensed product, OpenCV makes it easy for businesses to utilize and modify the code.[1]

In a nutshell, you can use OpenCV to do any kind of image transformations, it’s fairly straightforward library.

If you don’t already have it installed, it’ll be just a single line in terminal:

pip install opencv-python

And that’s pretty much it. It was easy up until this point, but that’s about to change.

PyTesseract

What the heck is this library? Well, according to Wikipedia:

Tesseract is an optical character recognition engine for various operating systems. It is free software, released under the Apache License, Version 2.0, and development has been sponsored by Google since 2006.[2]

I’m sure there are more sophisticated libraries available now, but I’ve found this one working out pretty well. Based on my own experience, this library should be able to read text from any image, provided that the font isn’t some bulls*** that even you aren’t able to read.

If it can’t read from your image, spend more time playing around with OpenCV, applying various filters to make the text stand out.

Now the installation is a bit of a pain in the bottom. If you are on Linux it all boils down to a couple of sudo-apt get commands:

sudo apt-get update
sudo apt-get install tesseract-ocr
sudo apt-get install libtesseract-dev

I’m on Windows, so the process is a bit more tedious.

First, open up THIS URL, and download 32bit or 64bit installer:

This is image title

The installation by itself is straightforward, boils down to clicking Next a couple of times. And yeah, you also need to do a pip installation:

pip install pytesseract

Is that all? Well, no. You still need to tell Python where Tesseract is installed. On Linux machines, I didn’t have to do so, but it’s required on Windows. By default, it’s installed in Program Files.

If you did everything correctly, executing this cell should not yield any error:

This is image title

Is everything good? You may proceed.

Reading the Text

Let’s start with a simple one. I’ve found a couple of royalty-free images that contain some sort of text, and the first one is this:

Reading the Text

It should be the easy one, and there exists a possibility that Tesseract will read those blue ‘objects’ as brackets. Let’ see what will happen:

This is image title

My claim was true. It’s not a problem though, you could easily address those with some Python magic.

The next one could be more tricky:

This is image title

I hope it won’t detect that ‘B’ on the coin:

This is image title

Looks like it works perfectly.

Now it’s up to you to apply this to your own problem. OpenCV skills could be of vital importance here if the text blends with the background.

Before you leave

Reading text from an image is a pretty difficult task for a computer to perform. Think about it, the computer doesn’t know what a letter is, it only works only with numbers. What happens behind the hood might seem like a black box at first, but I encourage you to investigate further if this is your area of interest.

I’m not saying that PyTesseract will work perfectly every time, but I’ve found it good enough even on some trickier images. But not straight out of the box. Some image manipulation is required to make the text stand out.

It’s a complex topic, I know. Take it one day at a time. One day it will be second nature to you.



https://morioh.com/p/177cde94de0e?f=5c21fb01c16e2556b555ab32&_lrsc=3e30293b-e197-4a5c-b336-addd285eb852


References

[1] https://opencv.org/about/

[2] https://en.wikipedia.org/wiki/Tesseract_(software)

1-10 of 48