Use GPT-Crawler to scrape websites into JSON

Published in AI Dev

August 01, 2024

1 min read

In many cases, we want to extract data from a website. This is especially true when we are doing fine-tuning a model for downstream tasks. For example, in the case of BERT, we need to train a model on a specific task (e.g., sentiment analysis). In this case, we want to extract the text from the website and feed it into the model for fine-tuning.

GPT-Crawler is a great tool that allows you to scrape websites into JSON, which can be used as input for downstream tasks.

In this blog post, we will walk through how to use GPT-Crawler to scrape websites and extract text from them.

Install GPT-Crawler

Clone the repository: git clone https://github.com/BuilderIO/gpt-crawler
Edit the config.ts file as below:

export const defaultConfig: Config = {
  url: "https://arxiv.org/abs/1810.04805",
  match: "https://arxiv.org/abs/1810.04805/**",
  maxPagesToCrawl: 500,
  outputFileName: "output.json",
};

Run npm i to install dependencies.
Run npx playwright install to install Playw.
Run npx playwright install-deps to install the browser dependencies.
Run npm start to start crawling.
The output will be saved in a file called output.json.

Conclusion

In this blog post, we walked through how to use GPT-Crawler to scrape websites and extract text from them. We also showed you how to customize the configuration of GPT-Crawler by editing the config file. We hope that this article was helpful for you in your journey towards building AI applications.

Use GPT-Crawler to scrape websites into JSON

Table Of Contents

Install GPT-Crawler

Conclusion

Tags

Share

Use GPT-Crawler to scrape websites into JSON

Table Of Contents

.css-kngxy1{box-sizing:border-box;margin:0;min-width:0;display:block;color:var(--theme-ui-colors-heading,#232621);font-weight:bold;-webkit-text-decoration:none;text-decoration:none;margin-bottom:1rem;font-size:1.5rem;position:relative;}Install GPT-Crawler

Conclusion

Tags

Share

Install GPT-Crawler