Usually aside from making sure that the difference in measurements you observed are statistically significant, you want your observed differences to be *practically* significant as well. That means you don’t want to test simply if your statistics (e.g. averages) differ, but you want them to differ by some margin, $\epsilon$. The size of the margin is dependent on the application – if you want to increase a click-through rate you probably have a business goal clearly specified, saying something like *Increase CTR by 10 percentage points*.

So how do we do test for a margin of difference? We’ll need to run what’s called a *composite hypothesis test*. Deriving it is a fun practice in fundamentals of hypothesis testing, so let’s first derive the test for the case of a single sample, comparing its mean to a constant. Then we’ll modify it to make a two-sample test, enabling us to compare averages of two samples.

One-sample test

Let’s assume our data points $X_{1},X_{2},…,X_{n}$ come as iid observations from some unknown distribution, with mean $\mu$ and variance $\sigma^2$. We want to test with the following null and alternative hypotheses

\[ \begin{eqnarray*} H_{0}:\left|\mu-\mu_{0}\right| & \le & \epsilon\\ H_{1}:\left|\mu-\mu_{0}\right| & > & \epsilon \end{eqnarray*} \]where $\epsilon > 0$ is a constant quantifying our desired *practical significance*.

Unrolling this, we get

$H_{0}:(\mu\ge\mu_{0}-\epsilon)\ and\ (\mu\le\mu_{0}+\epsilon)$

$H_{1}:(\mu<\mu_{0}-\epsilon)\ and\ (\mu>\mu_{0}+\epsilon)$

which should make it obvious why it’s called a composite hypothesis test.

As usual, we’ll partition the parameter space $\Theta$ into two subspaces, corresponding to parameter spaces for $H_0$ and $H_1$:

$\Theta_{0}=\left\{ \mu\in\mathbb{\mathbb{R}}\mid(\mu\ge\mu_{0}-\epsilon)\ and\ (\mu\le\mu_{0}+\epsilon)\right\} $

$\Theta_{1}=\left\{ \mu\in\mathbb{\mathbb{R}}\mid(\mu<\mu_{0}-\epsilon)\ and\ (\mu>\mu_{0}+\epsilon)\right\} $

Since we want to test the mean, we’ll use sample average as the estimate for $\mu$:

\[

\hat{\mu}=\bar{X}_{n}=\frac{1}{n}\sum_{i=1}^{n}X_{i}

\]

Central limit theorem tells us that as $n$ grows, $\bar{X}_n$ converges in distribution to a normal random variable:

\[

n\rightarrow\infty:\ \sqrt{n}(\bar{X}_{n}-\mu)\overset{i.d}{\rightarrow}\mathcal{N}\left(0,\sigma^{2}\right)

\]

and the Continuous mapping theorem allows us to go back and forth and get that:

\[n\rightarrow\infty:\ \bar{X}_{n}\overset{i.d}{\rightarrow}\mathcal{N}\left(\mu,\frac{\sigma^{2}}{n}\right)\]

So under $H_0$, $\bar{X}_n$ can be asymptotically distributed as any of the $\mathcal{N}\left(\mu,\frac{\sigma^{2}}{n}\right)$ for $\mu\in\Theta_{0}$

We want to design a test with significance level $\alpha$, limiting the *Type 1 error*. Let’s consider the following test for some $z \ge \epsilon$:

\[\psi=\begin{cases}1\ (H_{0}\ rejected) & if\ (\hat{\mu}\le\mu_{0}-z)\ or\ (\hat{\mu}\ge\mu_{0}+z)\\0\ (H_{0}\ not\ rejected) & otherwise \end{cases}\]

We denote as *Type I error* the error of falsely rejecting the null hypothesis. Formally, we define the Type I error rate $\alpha_\psi$ as the probability of rejecting the null hypothesis when it was in fact true.

$\alpha_\psi=P_\mu(\psi=1); \mu\in\Theta_0$

The ** level** $\alpha$ of a test is the largest

$\alpha_\psi=P_\mu(\psi=1)\le\alpha, \forall\mu\in\Theta_0$

Thus, for our test $\psi$, we have the level

\[ \begin{eqnarray*} \alpha & = & \underset{\mu\in\Theta_{0}}{sup}P_{\mu}\left(\psi=1\right)\\ & = & \underset{\mu\in\Theta_{0}}{sup}P_{\mu}\left((\hat{\mu}<\mu_{0}-z)\ or\ (\hat{\mu}>\mu_{0}+z)\right)\\ & = & \underset{\mu\in\Theta_{0}}{sup}P_{\mu}\left((\hat{\mu}-\mu_{0}<-z)\ or\ (\hat{\mu}-\mu_{0}>z)\right)\\ & \overset{by\ CLT}{\sim} & \underset{\mu\in\Theta_{0}}{sup}\left\{ P\left(\mathcal{N}\left(\mu-\mu_{0},\frac{\sigma^{2}}{n}\right)<-z\right)\ +\ P\left(\mathcal{N}\left(\mu-\mu_{0},\frac{\sigma^{2}}{n}\right)>z\right)\right\} \\ & = & \underset{\mu\in\Theta_{0}}{sup}\left\{ P\left(\mathcal{N}\left(0,1\right)<\sqrt{n}\frac{-z-(\mu-\mu_{0})}{\sigma}\right)\ +\ P\left(\mathcal{N}\left(0,1\right)>\sqrt{n}\frac{z-(\mu-\mu_{0})}{\sigma}\right)\right\} \\ & = & \underset{\mu\in\Theta_{0}}{sup}\left\{ \Phi\left(\sqrt{n}\frac{-z-(\mu-\mu_{0})}{\sigma}\right)+1-\Phi\left(\sqrt{n}\frac{z-(\mu-\mu_{0})}{\sigma}\right)\right\} \\ \end{eqnarray*} \]Where $\Phi$ is the cumulative distribution function of a standard Gaussian.

Let’s see how $\alpha_\psi$ behaves as we move $\mu$ over $\Theta_0$, that is from $\mu_0-\epsilon$ to $\mu_0+\epsilon$. We’ll take the derivative of $\alpha_\psi$ with respect to $\mu$:

\[ \begin{eqnarray*} \frac{\partial}{\partial\mu}\alpha_{\psi}(\mu) & = & \frac{\partial}{\partial\mu}\left(P_{\mu}\left(\psi=1\right)\right)\\ & = & -\phi\left(\sqrt{n}\frac{-z-(\mu-\mu_{0})}{\sigma}\right)\cdot\frac{\sqrt{n}}{\sigma}+\phi\left(\sqrt{n}\frac{z-(\mu-\mu_{0})}{\sigma}\right)\cdot\frac{\sqrt{n}}{\sigma}\\ & = & \frac{\sqrt{n}}{\sigma}\left(\phi\left(\sqrt{n}\frac{z-(\mu-\mu_{0})}{\sigma}\right)-\phi\left(\sqrt{n}\frac{-z-(\mu-\mu_{0})}{\sigma}\right)\right) \end{eqnarray*} \]Where $\phi(x)=\frac{\partial}{\partial{x}}\Phi(x)$ is the pdf of a standard Gaussian.

We can show by symmetry of $\phi$ that $\frac{\partial}{\partial\mu}\alpha_{\psi}(\mu)=0$ when $\mu=\mu_0$, and using properties of the Gaussian pdf prove that regardless of $n$ and $\sigma$:

$\frac{\partial}{\partial\mu}\alpha_{\psi}(\mu)<0\ for\ \mu<\mu_0$

$\frac{\partial}{\partial\mu}\alpha_{\psi}(\mu)>0\ for\ \mu>\mu_0$

Concretely, this means that our $\alpha_\psi$ is the smallest at $\mu=\mu_0$ and grows as we move away from $\mu_0$. It’s easily shown that $\alpha_\psi(\mu_0-\epsilon)=\alpha_\psi(\mu_0+\epsilon)$, ie. it has the same (largest) value at the edges of $\Theta_0$, i.e. when $\mu=\mu_0\pm\epsilon$, or formally:

\[ \begin{eqnarray*} \underset{\mu\in\Theta_{0}}{argsup}P_{\mu}\left(\psi=1\right) & = & \mu_{0}\pm\epsilon\\ \underset{\mu\in\Theta_{0}}{sup}P_{\mu}\left(\psi=1\right) & = & \Phi\left(\sqrt{n}\frac{-z-\epsilon}{\sigma}\right)+1-\Phi\left(\sqrt{n}\frac{z-\epsilon}{\sigma}\right) \end{eqnarray*} \\ \\ \]So far, we’ve shown that with our test defined as:

\[\psi=\begin{cases}1\ (H_{0}\ rejected) & if\ (\hat{\mu}<\mu_{0}-z)\ or\ (\hat{\mu}>\mu_{0}+z)\\0\ (H_{0}\ not\ rejected) & otherwise \end{cases}\]and for an arbitrary $z\ge\epsilon$, our test has level $\alpha=\Phi\left(\sqrt{n}\frac{-z-\epsilon}{\sigma}\right)+1-\Phi\left(\sqrt{n}\frac{z-\epsilon}{\sigma}\right)$.

Let’s now go the other way and pick $z$ for a desired level $\alpha$. Looking at the equation for $\alpha$ above, it doesn’t look obvious how to find $z$. To avoid solving this equation, we’ll note the following: as we move $z$ farther from $\epsilon$ our $\alpha$ gets smaller. That means we can find $z_\alpha$ numerically by bisection method.

However, since getting the *p-value* is enough for a test, we don’t actually need to solve for $z_\alpha$. Note that as $z$ increases, $\alpha$ decreases, and from the definition of our test we have that the largest $z$ at which we’ll ever reject is $z=\left|\hat{\mu}-\mu_{0}\right|$. This means that the smallest level at which we can reject is given by:

As a special case, consider the test with $\epsilon=0$. We’ll have:

\[ \begin{eqnarray*} H_{0}:\left|\mu-\mu_{0}\right| & = & 0\Leftrightarrow\mu=\mu_{0}\\ H_{1}:\left|\mu-\mu_{0}\right| & > & 0\Leftrightarrow\mu\ne\mu_{0} \end{eqnarray*} \]and:

\[ \begin{eqnarray*} \text{p-value} & = & \underset{z\in[\epsilon,\infty]}{min}\alpha=\Phi\left(\sqrt{n}\frac{-\left|\hat{\mu}-\mu_{0}\right|}{\sigma}\right)+1-\Phi\left(\sqrt{n}\frac{\left|\hat{\mu}-\mu_{0}\right|}{\sigma}\right)\\ & = & 2\cdot\Phi\left(\sqrt{n}\frac{-\left|\hat{\mu}-\mu_{0}\right|}{\sigma}\right) \end{eqnarray*} \]which is exactly the simple hypothesis two-sided test.

Two-sample test

So far we’ve just derived a one-sample test, so we need to modify it a bit to test the difference between means of two samples.

Given $n$ observations $X_1…X_n$ from a distribution with mean $\mu_X$ and variance $\sigma_X^2$, and $m$ observations $Y_1…Y_m$ from a distribution with mean $\mu_Y$ and variance $\sigma_Y^2$ , we formulate the following null and alternative hypotheses:

\[ \begin{eqnarray*} H_{0} & : & \left|\mu_{1}-\mu_{2}\right|\le\epsilon\\ H_{1} & : & \left|\mu_{1}-\mu_{2}\right|>\epsilon \end{eqnarray*} \]Which is basically saying $\mu_1$ is different from $\mu_2$ by at least a margin of $\epsilon$. We use $\epsilon$ here to state our desired *practical significance*.

Let’s define $d$ as $d=\mu_1-\mu_2$. Then we can rewrite our hypotheses as:

\[ \begin{eqnarray*} H_{0} & : & \left|d-0\right|\le\epsilon\\ H_{1} & : & \left|d-0\right|>\epsilon \end{eqnarray*} \]We’ll use $\hat{d}=\hat{\mu}_{1}-\hat{\mu}_{2}=\bar{X}_{n}-\bar{Y}_{n}$ as the estimator for $d$. From the Central limit theorem and the multivariate delta method, we get that:

\[ \hat{d}=\bar{X}_{n}-\bar{Y}_{n}\overset{i.d.}{\rightarrow}\mathcal{N}\left(\mu_{X}-\mu_{Y},\frac{\sigma_{X}^{2}}{n}+\frac{\sigma_{Y}^{2}}{m}\right) \]Substituting $\hat{\mu}\rightarrow\hat{d}$, $\mu_0\rightarrow0$, $\frac{\sigma^{2}}{n}\rightarrow\frac{\sigma_{X}^{2}}{n}+\frac{\sigma_{Y}^{2}}{m}$ and plugging into our formula for p-value, we get:

\[ \text{p-value}=\Phi\left(\frac{-\left|\hat{d}\right|-\epsilon}{\sqrt{\frac{\sigma_{X}^{2}}{n}+\frac{\sigma_{Y}^{2}}{m}}}\right)+1-\Phi\left(\frac{\left|\hat{d}\right|-\epsilon}{\sqrt{\frac{\sigma_{X}^{2}}{n}+\frac{\sigma_{Y}^{2}}{m}}}\right) \]Note once again that setting $\epsilon=0$ we get the familiar form for a simple two-sample two-sided test for difference of means.

Conclusion

We’ve shown a step-by-step exercise of deriving a two-sample two-sided Z-test with a margin of tolerance. It should be even simpler to derive a one-sided test, as it’s just a modification of a simple hypothesis one-sided test with an extended range for $\Theta_0$.

Using the formulas above, in case we’re rejecting $H_0$ we can also find the largest $\epsilon$ at which we can reject with a given p-value. This gives us an upper bound on the difference between means for a given statistical significance. But note that in that case, you need to apply similar caution as when avoiding p-value hacking.

]]>Recently I wanted to try out the Google Natural Language API for sentiment analysis and run it on responses to Medium articles. *Responses* are what would be called comments elsewhere, but you’ll see that they’re actually the same thing as full blown posts when it comes to fetching them via the JSON API. To avoid confusion with HTTP and JSON responses, I’ll just call them *comments* here.

Now all this is a bit of a hack and it’s there’s not proper official Medium API to do this, but at the moment this works like a charm. The trick – to get a JSON of any article just append

`?format=json`

to that article’s url. For example try this link

Note that the JSON response starts with `])}while(1);</x>`

which is a way of preventing JSON hijacking, so if you’re fetching the data using Python `requests`

library or similar, you’ll have to remove this before parsing the JSON.

Anyways, when you inspect the JSON of an article (Postman is a great tool to do it), you’ll note that you don’t receive any comments data in it.

Using dev tools in your browser of choice to inspect network requests, you can find out how Medium fetches comments when displaying them. Open a Medium post that has some comments, scroll to the bottom, and note the network activity when you click * Show all responses*. A request is sent to

https://medium.com/_/api/posts/45777098038c/responsesStream?filter=other

The number in the middle is the *post id*. Removing `?filter=other`

makes sense to get all the comments. I haven’t tested, though, if there’s any paging for huge number of comments, but it seems to work for posts with dozens of comments.

The response looks something like this:

{

"success": true,

"payload": {

"streamItems": [

{

"createdAt": 1562935113319,

"postPreview": {

"postId": "d11246b4b1f2"

},

"randomId": "19028b3d16f3",

"itemType": "postPreview",

"type": "StreamItem"

},

{

"createdAt": 1562935113319,

"postPreview": {

"postId": "37b92608ffd8"

},

"randomId": "21c8cb92af8a",

"itemType": "postPreview",

"type": "StreamItem"

},

{

"createdAt": 1562935113319,

"postPreview": {

"postId": "4eea41e8e379"

},

"randomId": "63ad9117c6bc",

"itemType": "postPreview",

"type": "StreamItem"

},

....

It’s just a list of ids for article comments. Remember that Medium comments are structured as full blown articles, so we’ll just need to fetch each of them separately, by id.

Given an article url, here’s what we’ll do:

`> Fetch article JSON`

`> Get article ID from JSON`

`> Fetch comments list`

`> Fetch each comment from the list`

Here’s a code sample that does just that:

Run it and you should get output looking like this:

Post id: 45777098038c

('d11246b4b1f2', ['Old saying: Leica makes the best lenses, Canon makes the best bodies, Nikon makes the best compromises.'])

('37b92608ffd8', ['Canon may have a higher market share, but they do not make better DSLRs. Best cropped sensor: Nikon D500. Best all around: Nikon D850. Best sports/wildlife: Nikon D5, though Canon 1DX Mk II is a very good flagship camera. Nikon’s AF is superior. Ask Melissa Groo, longtime pro Canon shooter who is seriously considering switching to Nikon after experiencing the AF on the D850.'])

('4eea41e8e379', ['Well for your information, The photolithography fabs at Intel are run by Nikon. Their machines are vital for every chip! So, when you see those tiny little chips in your iPhone, your computers and tablets, you can thank Nikon and the photolithography Nikon employees who run those machines and the people who fix them.'])

('80db3c28c10a', ['I’d heard this story before, but never told in such detail, or with such awesome accompanying visuals. Thanks for sharing it!'])

...

That’s really all there is to it. Whether you want to do NLP on Medium articles or use the data for something else, this should help you get started (and make sure you comply with Medium’s terms of service).

]]>A thing that always annoyed me is returning to top level directory in a git repo from somewhere deeper within.

It turns out that there’s a really simple solution – just make a bash alias `gitroot`

as follows

`alias gitroot='cd $(git rev-parse --show-toplevel)'`

Put that line either in `~/.bashrc`

or if you use zsh in `~/.zshrc`

and you’re all set. Whenever inside a git repo, typing `gitroot`

will take you to its root dir.

The idea that I went with was to make two bash scripts:

`savepass.sh`

`getpass.sh`

used to retrieve the password from the encrypted file and pipe it to`pbcopy`

. (I use Mac OS, but there are Linux alternatives to`pbcopy`

)

It’s not a particularly original idea and you can probably find more robust implementations somewhere on GitHub.

The first step is to generate a **gpg** key. Here’s a nice tutorial on gpg that should be enough to understand what’s going on here.

Once we’ve generated our key, we can access it from the gpg keychain, and it’s not a bad idea to export the private key and create a backup somewhere, since you can’t recover your passwords without it.

To list the available keys use `gpg --list-secret-keys`

, that gives you something like this:

------------------------

sec 2048R/ABCD10BF 2019-07-09

uid sh_pass_manager (key for sh_pass_manager) <sh_pass_manager@sh_pass_manager>

ssb 2048R/BAC4160B 2019-07-09

Now we can create the `savepass.sh`

that will take the service name (eg. “gmail.com”) as a parameter and save the encrypted random password in a file

#!/bin/sh SERVICE_NAME=$1 SCRIPT_DIR="$( cd "$( dirname "${BASH_SOURCE[0]}" )" && pwd )" mkdir -p "$SCRIPT_DIR/pass" GPG_RECEPIENT="sh_pass_manager (key for sh_pass_manager) <sh_pass_manager@sh_pass_manager>" openssl rand -base64 32 | gpg --encrypt --recipient "$GPG_RECEPIENT" > "$SCRIPT_DIR/pass/$SERVICE_NAME"

Run this with `./savepass.sh gmail.com`

and you should get a file `pass/gmail.com`

. Open it with `less`

to check that it’s a garbled binary file.

Note that we used `openssl rand -base64 32`

to generate a random password which is nothing but 32 random bytes encoded in `base64`

(see the docs).

We can now write the `getpass.sh`

to retrieve our encrypted passwords.

#!/bin/sh set -e set -o pipefail PASS_FILE=$1 SCRIPT_DIR="$( cd "$( dirname "${BASH_SOURCE[0]}" )" && pwd )" gpg --decrypt $PASS_FILE | pbcopy echo "Password copied to clipboard

It’s as simple as calling `gpg --decrypt`

on an encrypted file.

Now, if all this is done inside a git repo, you can backup your encrypted passwords by pushing them to a git remote. (You can also do it automatically upon password creation, at the end of `savepass.sh`

)

You can clone the two files from https://github.com/drazenz/sh-pass-manager.

**Disclaimer: I am by no means a computer security expert, and the simple hack above likely has some security flaws. It’s aimed more as an illustration of a simple unix command line workflow, rather than a properly secured password manager.**

*[Note: You can also read this post on Medium, where you can clap if you like it]*

You already know that if you have a data set with many columns, a good way to quickly check correlations among columns is by visualizing the correlation matrix as a heatmap.

But is a simple heatmap the best way to do it?

For illustration, I’ll use the Automobile Data Set, containing various characteristics of a number of cars. You can also find a clean version of the data with header columns here.

Let’s start by making a correlation matrix heatmap for the data set.

Great! Green means positive, red means negative. The stronger the color, the larger the correlation magnitude. Now looking at the chart above, think about the following questions:

- Where do your eyes jump first when you look at the chart?
- What’s the strongest and what’s the weakest correlated pair (except the main diagonal)?
- What are the three variables most correlated with
*price*?

If you’re like most people, you’ll find it hard to map the color scale to numbers and vice versa.

Distinguishing positive from negative is easy, as well as 0 from 1. But what about the second question? Finding the highest negative and positive correlations mean finding the strongest red and green. To do that I need to carefully scan the entire grid. Try to answer it again and notice how your eyes are jumping around the plot, and sometimes going to the legend.

Now consider the following plot:

In addition to color, we’ve added size as a parameter to our heatmap. The size of each square corresponds to the magnitude of the correlation it represents, that is

**size(c1, c2) ~ abs(corr(c1, c2))**

Now try to answer the questions using the latter plot. Notice how weak correlations visually disappear, and your eyes are immediately drawn to areas where there’s high correlation. Also note that it’s now easier to compare magnitudes of negative vs positive values (lighter red vs lighter green), and we can also compare values that are further apart.

If we’re mapping magnitudes, it’s much more natural to link them to the size of the representing object than to its color. That’s exactly why on bar charts you would use height to display measures, and colors to display categories, but not vice versa.

Discrete Joint Distributions

Let’s see how the cars in our data set are distributed according to horsepower and drivetrain layout. That is, we want to visualize the following table

drive-wheels → horsepower ↓ | 4wd | fwd | rwd |
---|---|---|---|

Low (0-100) | 5 | 89 | 15 |

Medium (100-150) | 3 | 24 | 35 |

High (150+) | 0 | 5 | 25 |

Consider the following two ways to do it

The second version, where we use square size to display counts makes it effortless to determine which group is the largest/smallest. It also gives some intuition about the marginal distributions, all without needing to refer to a color legend.

Great. So how do I make these plots?

To make a regular heatmap, we simply used the Seaborn *heatmap* function, with a bit of additional styling.

For the second kind, there’s no trivial way to make it using matplotlib or seaborn. We could use *corrplot* from biokit, but it helps with correlations only and isn’t very useful for two-dimensional distributions.

Building a robust parametrized function that enables us to make heatmaps with sized markers is a nice exercise in matplotlib, so I’ll show you how to do it step by step.

We’ll start by using a simple scatter plot with squares as markers. Then we’ll fix some issues with it, add color and size as parameters, make it more general and robust to various types of input, and finally make a wrapper function *corrplot* that takes a result of *DataFrame.corr* method and plots a correlation matrix, supplying all the necessary parameters to the more general *heatmap* function.

It’s just a scatter plot

If we want to plot elements on a grid made by two categorical axes, we can use a scatter plot.

Looks like we’re onto something. But I said it’s just a scatterplot, and there’s quite a lot happening in the previous code snippet.

Since the scatterplot requires *x *and *y* to be numeric arrays, we need to map our column names to numbers. And since we want our axis ticks to show column names instead of those numbers, we need to set custom *ticks* and *ticklabels*. Finally there’s code that loads the dataset, selects a subset of columns, calculates all the correlations, *melts* the data frame (the inverse of creating a pivot table) and feeds its columns to our *heatmap* function.

You noticed that our squares are placed where our gridlines intersect, instead of being centered in their cells. In order to move the squares to cell centers, we’ll actually move the grid. And to move the grid, we’ll actually turn off *major* gridlines, and set *minor* gridlines to go right in between our axis ticks.

That’s better. But now the left and bottom side look cropped. That’s because our axis lower limit are set to 0. We’ll sort this out by setting the lower limit for both axes to – 0.5. Remember, our points are displayed at integer coordinates, so our gridlines are at .5 coordinates.

Give it some color

Now comes the fun part. We need to map the possible range of values for correlation coefficients, *[-1, 1]*, to a color palette. We’ll use a *diverging* palette, going from red for -1, all the way to green for 1. Looking at Seaborn color palettes, seems that we’ll do just fine with something like

`sns.palplot(sns.diverging_palette(220, 20, n=7))`

But lets first flip the order of colors and make it smoother by adding more steps between red and green:

`palette = sns.diverging_palette(20, 220, n=256)`

Seaborn color palettes are just arrays of color components, so in order to map a correlation value to the appropriate color, we need to ultimately map it to an index in the palette array. It’s a simple mapping of one interval to another: [-1, 1] **→** [0, 1] **→** (0, 255).

*v ∈* [*val_min*, *val_max*]

↓

*t* = (*v* – *val_min) / (val_max – val_min)*

⇓*t *∈ [ 0.0, 1.0 ]

↓*ind* = *round(t) * *255

⇓*ind *∈ (0 1 2 … … … 254 255)

Just what we wanted. Let’s now add a color bar on the right side of the chart. We’ll use GridSpec to set up a plot grid with 1 row and *n* columns. Then we’ll use the rightmost column of the plot to display the color bar and the rest to display the heatmap.

There are multiple ways to display a color bar, here we’ll trick our eyes by using a really dense bar chart. We’ll draw *n_colors* horizontal bars, each colored with its respective color from the palette.

And we have our color bar.

We’re almost done. Now we should just flip the vertical axis so that we get correlation of each variable with itself shown on the main diagonal, make squares a bit larger and make the background a just a tad lighter so that values around 0 are more visible.

But let’s first make the entire code more useful.

More parameters!

It would be great if we made our function able to accept more than just a correlation matrix. To do this we’ll make the following changes:

- Be able to pass
*color_min, color_max*and*size_min, size_max*as parameters so that we can map different ranges than [-1, 1] to color and size. This will enable us to use the heatmap beyond correlations - Use a sequential palette if no palette specified, use a single color if no
*color*vector provided - Use a constant size if no
*size*vector provided. Avoid mapping the lowest value to 0 size. - Make
*x*and*y*the only necessary parameters, and pass*size, color, size_scale, size_range, color_range, palette, marker*as kwargs. Provide sensible defaults for each of the parameters - Use list comprehensions instead pandas
*apply*and*map*methods, so we can pass any kind of arrays as*x, y, color, size*instead of just*pandas.Series* - Pass any other kwargs to pyplot.scatterplot function
- Make a wrapper function
*corrplot*that accepts a*corr()*dataframe,*melts*it, calls*heatmap*with a red-green diverging color palette, and size/color min-max set to [-1, 1]

That’s quite a lot of boilerplate stuff to cover step by step, so here’s what it looks like when done. You can also check it out in this Kaggle kernel.

Now that we have our *corrplot* and *heatmap* functions, in order to create the correlation plot with sized squares, like the one at the beginning of this post, we simply do the following:

And just for fun, let’s make a plot showing how engine power is distributed among car brands in our data set.

]]>I’ve heard about James Clear before, related to his 2018 book Atomic Habits, and the name kind of stuck. I haven’t looked much into his work, and haven’t yet read the book. To be honest, I just assumed it’s another business book that’s in fact an idea inflated to fit a 200 page book because you can’t sell a printed blog post.

But listening to Clear talking about the structure of our habits, how to hack our behavior to form useful habits and what role these habits can play in success – I heard more than a few truly insightful ideas. That said, I might give the book an honest chance.

So, the podcast episode is titled *Building the Habits Necessary to Succeed as a Founder with James Clear*, but Clear’s observations are general, and applicable to any sort of endeavor or career.

I’ll outline the key points that struck me as particularly insightful.

Clear structures his thinking of habits with a model of 4 stages:

*Cue → Craving → Response → Reward*

So the *Cue* is an event that you witness occurring. Generally cues are objective, two people will react differently to the same cue. Now, the *Craving* depends on who you are, and it’s essentially your personal interpretation of a *Cue*. You’ll respond to this *Craving* with an action, thus the *Response* step. The response will give you a *Reward*, that’ll reinforce your *Response* to the same *Cue* in future.

Now, you can think of all your actions as responses to cues and cravings, so if you’re getting a negative reward (ie. punishment) for an action, you’ll hardly make a habit out of it. Simply put, you can’t make a habit of things you don’t consider rewarding.

If you want to start any kind of endeavor, you need to see the reward for it as soon as possible and as frequently as possible. Look for any kind of positive feedback, that’s either your end goal or strongly correlated with your end goal.

If you want get fit by exercising, you should go to gym because of the feeling of health it gives you after every session, and the physical fitness will come as a byproduct. If your checking the scale, you’re likely give up as you won’t see immediate results.

In general, this explain the *“enjoy the process”* maxim, often seen in relation to fitness programs or any endeavor where the reward comes only after a long time investment. If you can enjoy the process without looking at the seemingly unattainable goal every other moment, you’re more likely to endure.

Clear talks about his writing habit, and points out that with his current audience size, the feedback he gets for his writing is immediate. Even without it, when he’s writing a book, he’ll frequently seek feedback from friends, just so he can keep the behavior reinforced and rewarding.

But aside from an action being rewarding, you’ll want the cues to be clear and obvious, and a response to be feasible and easy to do. For example, you may want to write early in the morning when still haven’t started your messy day, and interruptions are unlikely.

Now, interestingly, trying to summarize this here made me curious about the book, where I guess Clear expands on all the strategies related to each step. (Edit: found this excerpt from the book on Clear’s blog https://jamesclear.com/three-steps-habit-change)

When starting out with writing, and you’re not sure how exactly to narrow down your topics, just explore whatever you find interesting and see what sticks both with you and your audience. Then double down on those areas.

An interesting point Clear makes – his website being JamesClear.com instead of GreatHabits.com gives him freedom to explore topics unrelated to a predetermined niche. But a downside of it is it makes it slightly harder to make it a brand when it’s made personal.

To quote Clear on this:

It’s easy to get focused on stuff that makes like the last five percent of difference. So people want to get in shape. They are like, all right, what running shoes do I need to buy or what knee sleeves do I should I get or which protein powder is the best but like all that stuff makes like last two percent of difference the thing that makes the 98% of the differences.

Pair this with Pareto principle – 20% of effort yields 80% of the effect. Done right, focusing on the right thing at the right time, can get you leaps forward.

]]>