Medium is a treasure trove if you want to practice NLP or get data to train your models for other purposes. Luckily, it’s possible to fetch JSON data for any medium post together with its metadata about author, claps, tags, etc.
Recently I wanted to try out the Google Natural Language API for sentiment analysis and run it on responses to Medium articles. Responses are what would be called comments elsewhere, but you’ll see that they’re actually the same thing as full blown posts when it comes to fetching them via the JSON API. To avoid confusion with HTTP and JSON responses, I’ll just call them comments here.
Now all this is a bit of a hack and it’s there’s not proper official Medium API to do this, but at the moment this works like a charm. The trick – to get a JSON of any article just append
to that article’s url. For example try this link
Note that the JSON response starts with
])}while(1);</x> which is a way of preventing JSON hijacking, so if you’re fetching the data using Python
requests library or similar, you’ll have to remove this before parsing the JSON.
Anyways, when you inspect the JSON of an article (Postman is a great tool to do it), you’ll note that you don’t receive any comments data in it.
Using dev tools in your browser of choice to inspect network requests, you can find out how Medium fetches comments when displaying them. Open a Medium post that has some comments, scroll to the bottom, and note the network activity when you click Show all responses. A request is sent to
The number in the middle is the post id. Removing
?filter=other makes sense to get all the comments. I haven’t tested, though, if there’s any paging for huge number of comments, but it seems to work for posts with dozens of comments.
The response looks something like this:
It’s just a list of ids for article comments. Remember that Medium comments are structured as full blown articles, so we’ll just need to fetch each of them separately, by id.
Given an article url, here’s what we’ll do:
> Fetch article JSON
> Get article ID from JSON
> Fetch comments list
> Fetch each comment from the list
Here’s a code sample that does just that:
Run it and you should get output looking like this:
Post id: 45777098038c
('d11246b4b1f2', ['Old saying: Leica makes the best lenses, Canon makes the best bodies, Nikon makes the best compromises.'])
('37b92608ffd8', ['Canon may have a higher market share, but they do not make better DSLRs. Best cropped sensor: Nikon D500. Best all around: Nikon D850. Best sports/wildlife: Nikon D5, though Canon 1DX Mk II is a very good flagship camera. Nikon’s AF is superior. Ask Melissa Groo, longtime pro Canon shooter who is seriously considering switching to Nikon after experiencing the AF on the D850.'])
('4eea41e8e379', ['Well for your information, The photolithography fabs at Intel are run by Nikon. Their machines are vital for every chip! So, when you see those tiny little chips in your iPhone, your computers and tablets, you can thank Nikon and the photolithography Nikon employees who run those machines and the people who fix them.'])
('80db3c28c10a', ['I’d heard this story before, but never told in such detail, or with such awesome accompanying visuals. Thanks for sharing it!'])
That’s really all there is to it. Whether you want to do NLP on Medium articles or use the data for something else, this should help you get started (and make sure you comply with Medium’s terms of service).