NOMUD Dev Blog
HomeAuthor

Use GPT-Crawler to scrape websites into JSON

Published in AI Dev
August 01, 2024
1 min read

Table Of Contents

01
Install GPT-Crawler
02
Conclusion

In many cases, we want to extract data from a website. This is especially true when we are doing fine-tuning a model for downstream tasks. For example, in the case of BERT, we need to train a model on a specific task (e.g., sentiment analysis). In this case, we want to extract the text from the website and feed it into the model for fine-tuning.

GPT-Crawler is a great tool that allows you to scrape websites into JSON, which can be used as input for downstream tasks.

In this blog post, we will walk through how to use GPT-Crawler to scrape websites and extract text from them.

Install GPT-Crawler

  1. Clone the repository: git clone https://github.com/BuilderIO/gpt-crawler
  2. Edit the config.ts file as below:
export const defaultConfig: Config = {
url: "https://arxiv.org/abs/1810.04805",
match: "https://arxiv.org/abs/1810.04805/**",
maxPagesToCrawl: 500,
outputFileName: "output.json",
};
  1. Run npm i to install dependencies.
  2. Run npx playwright install to install Playw.
  3. Run npx playwright install-deps to install the browser dependencies.
  4. Run npm start to start crawling.
  5. The output will be saved in a file called output.json.

Conclusion

In this blog post, we walked through how to use GPT-Crawler to scrape websites and extract text from them. We also showed you how to customize the configuration of GPT-Crawler by editing the config file. We hope that this article was helpful for you in your journey towards building AI applications.


Tags

#gptcrawler#crawler

Share

Previous Article
Integrate Auth.js (>5.0) and Prisma into Next.js
© 2024, All Rights Reserved.