title: @aniketapanjwani: One of the easiest and most useful tasks to which to put Claude Code is scraping...
author: aniketapanjwani
content_type: twitter_article
published: 2025-11-03T18:45:33+00:00
source_url: https://x.com/aniketapanjwani/status/2029304239345017050
word_count: 1500
One of the easiest and most useful tasks to which to put Claude Code is scraping data.
However, getting optimal data scraping results from Claude Code depend on giving it the right nudges and access to the correct tools.
In this article, I'm going to cover exhaustively nine different ways to scrape data with Claude Code.
Video version of this Article
Everything I discuss in this article is also available in the following YouTube video: https://youtu.be/4jQJdCfjPcw?si=Iv0F54vCa-JrdEUK
In the video, I do live demos walking through each of the following nine ways of scraping data with Claude Code so that you can see directly how they work.
Way 1: Just ask Claude Code to Scrape the Site
For a large set of sites, you can just tell Claude Code to scrape the site, tell it what you want scraped, and ask it to write it out to a CSV or a SQLite file.
It will poke around the site for you, probably write a Python script, run the script, maybe even write some unit tests, and then just write out the data somewhere on your computer.
Way 2: Ask Claude Code to Find Endpoints
A lot of interesting data is not rendered as a static page, but is being loaded dynamically via some API call. Sometimes Claude Code will reverse engineer that API call itself, but sometimes you have to nudge it along and tell it explicitly, "Hey, look for an API which, for example, is showing this hotel pricing and booking data, which you may want for a research paper or for competitor analysis."
The only difference between the previous method is that in this method you ask it to look for endpoints. Just giving that word, that little nudge, will sometimes get you better results than if you just ask it to scrape the site.
Way 3: ScrapeCreators
A lot of useful data on most social media sites is scrapeable, but they make their endpoints particularly difficult to reverse engineer.
They have their own anti-bot rules, and they change the way selectors work each week. You could have Claude Code or Codex continually trying to reverse engineer them, but what I like doing is using a tool called Scrape Creators ( https://scrapecreators.com/ ).
It has API endpoints for pretty much every social media API. I would recommend creating a skill for the Scrape Creators endpoints as a one-time utility that then your agentic coding tool will always have access to.
Way 4: Apify Actor
Apify is a marketplace of scrapers. For a lot of difficult-to-scrape websites, people have made rentable scrapers that are available on Apify (called "actors").
One scraper available there that I like to use is the Google Maps scraper, which can be quite useful for social scientists for either doing some kind of direct analyses with that data or for creating proxy measures. It is also for business people to do competition analysis or to find local leads.
The only problem is you have to pay for these. Some of them you pay by usage. Some of them you have to rent by month. After a limited free trial, you have to pay for an Apify subscription, which can go to your usage of the usage-based Apify actors.
Way 5: Firecrawl -> Markdown -> Structured Extraction
A lot of data that you're going to want to get is not going to be highly structured.
For example, when I was working on my EconNow project, I had to scrape lots of economics job market candidate pages.
Each of these pages had their own HTML structure, so the way in which I wanted to scrape them was not to rely on writing individual scrapers for each web page.
Instead, a common technique is to turn the web page into Markdown and then have an LLM, like those of OpenAI, parse the Markdown and create some kind of structured output.
Firecrawl is a paid service which can be used to easily turn web pages into Markdown.
It's also available as an open-source project, but everything I've seen about the open-source offering is that it's shit, so for me there's enough of an ROI that I pay for Firecrawl myself.
But basically, when you have that tool to turn a webpage into Markdown, you can then pass that Markdown to OpenAI by API. If you set up your structured outputs correctly according to how the API expects you to pass structured outputs, you'll be able to get the LLM to parse out particular kinds of fields from these different varying unstructured websites.
Way 6: DIY HTML -> Markdown -> Structured Extraction
Now, you might be asking me: Hey dummy, why are you paying for Firecrawl? Can't you just turn it into Markdown yourself?
And the answer is yeah, you can. There are tools that allow you to do that yourself too.
Here's one: https://github.com/mixmark-io/turndown
Here's another: https://github.com/microsoft/markitdown
The reason I use Firecrawl is because I find it handles certain edge cases better. It's just a very nicely designed service, and those incremental improvements are worth it to me. If you're on a small budget, like you're an academic and you got a $20/month Codex subscription and that's all you're paying for, then definitely don't pay for Firecrawl.
Just use one of these packages instead, or rather just point Claude Code to it and say, "Hey, turn this into Markdown for me, and help me use OpenAI API to extract the data."
Now, another thing to point out is that you don't even need to send for small-scale stuff the Markdown to an external API call. You could just have Claude Code or Codex itself do the structured extraction.
If you're working at a scale of thousands or tens of thousands of documents, it's gonna be a pain in the ass, not something you'll want to do.
Way 7: yt-dlp
yt-dlp is a tool which lets you scrape any YouTube video, its metadata, and the subtitles of the videos.
I basically never watch videos anymore. I'll just download the subtitles and then have Claude Code or Codex create a personalized summary for me, applying the video to whatever context it is that I actually care about.
In this video, I do a live exercise using Claude Code with YTDLP to reverse engineer an AI YouTubers successful videos: https://youtu.be/rxQl4A9-dnk?si=o6PBkrGIBocakuzL
I made that YouTube video as a throwaway, but the resulting product that I created live on that video, I actually do use quite often to help me think about what videos to make and how to ideate my videos.
There is a ton of useful data in YouTube videos, and I really think that this tool in particular is highly under-exploited.
Way 8: Reddit JSON Endpoint
Reddit has a JSON endpoint which can be used to find just about anything on it.
You just add on ".json" to the end of a Reddit URL, and then your agentic coding tool has access to everything on that part of Reddit as a JSON document.
Take a look at this link for example to the Claude Code subreddit's JSON endpoint.
I have some skills set up that I use to basically keep a pulse on what people are talking about on a large variety of subreddits that I care about. Those skills are just Claude Code or Codex hitting the JSON Reddit endpoints.
Way 9: Agent Browser + Credentials
There's a lot of sites that are protected behind some kind of authentication. in order to get past that authentication, you have two possible approaches you can take.
First, you can do that authentication exchange. Then, through the exchange, you will sometimes get a cookie that gets stored on your computer, and then Claude Code can use that cookie to authenticate and see authenticated views.
The other option is to use a tool by Vercel called Agent Browser.
It's a browser automation CLI created by Vercel which is optimized for use by agents.
For small-scale scraping, I've been preferring to use Agent Browser.
For example, you could store your Facebook login credentials somewhere that Claude Code or Codex have access to, either in some online vault to which you have some secure key exchange mechanism, or just yolo it and store it in your environment variables of your terminal or in some .env file.
And then you could create a skill which Claude Code uses to scrape Facebook groups that you are in by logging in with your Facebook login credentials in Agent Browser and then going to the Facebook group, grabbing all the posts, and writing out that data to somewhere you want it written out.
Postscript
I run a community of over 1,300 AI developers, business people, and social scientists trying to get to the cutting edge of agent coding: https://www.skool.com/the-ai-mba
If you're interested in learning more and want to learn in a community, then you should definitely join!
Posted: 2025-11-03T18:45:33.000Z
Engagement: 516 likes, 47 retweets, 7 replies