Blog posts are a key component in learning new technologies accross the web. With OpenAI’s GPT , we can
take this a step further by summarizing blog posts into a few sentences. Allowing quicker consumption while retaining
key concepts.
In this article, we will learn how to scrape blog post content and summarize it using OpenAI’s GPT via TypeScript.
Setup
Before getting started, please ensure you have an OpenAI account. If not, you can signup on their website
here . Once signed up, take note of your API Key as we’ll need it later.
This project will be using Node.js 18.x. You can check your Node.js version by running node -v
within the terminal.
Let’s begin by initializing our project and installing dependencies:
$el.removeAttribute('data-checked'), 2500);
" type="button" class="inline-flex items-center justify-center gap-2 whitespace-nowrap rounded focusable disabled:opacity-50 disabled:pointer-events-none motion-safe:transition-colors aria-disabled:opacity-50 aria-disabled:pointer-events-none bg-transparent text-accent-11 border-none hover:bg-accent-3 hover:text-accent-12 size-8 group absolute top-0.5 right-0.5"> # Create the project directory and navigating into it
mkdir summarize-blog-posts-with-typescript-and-openais-gpt && cd $_
# Initialize project
npm init -y && npm pkg set type=module
# Install dependencies
npm i openai gpt-3-encoder cheerio user-agents dotenv typescript @types/node @types/user-agents
Now that we have our project setup, let’s create an .env
file and place our secret OpenAI API Key in there.
.env $el.removeAttribute('data-checked'), 2500);
" type="button" class="inline-flex items-center justify-center gap-2 whitespace-nowrap rounded focusable disabled:opacity-50 disabled:pointer-events-none motion-safe:transition-colors aria-disabled:opacity-50 aria-disabled:pointer-events-none bg-transparent text-accent-11 border-none hover:bg-accent-3 hover:text-accent-12 size-8 group absolute top-0.5 right-0.5"> OPENAI_API_KEY = 'xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx'
Next, create a tsconfig.json
file at the root of our project. This will allow us to leverage TypeScript within our
project.
You can read about each of these options in more detail here .
tsconfig.json $el.removeAttribute('data-checked'), 2500);
" type="button" class="inline-flex items-center justify-center gap-2 whitespace-nowrap rounded focusable disabled:opacity-50 disabled:pointer-events-none motion-safe:transition-colors aria-disabled:opacity-50 aria-disabled:pointer-events-none bg-transparent text-accent-11 border-none hover:bg-accent-3 hover:text-accent-12 size-8 group absolute top-0.5 right-0.5"> {
"compilerOptions" : {
"target" : "es2022" ,
"module" : "es2022" ,
"esModuleInterop" : true ,
"forceConsistentCasingInFileNames" : true ,
"strict" : true ,
"skipLibCheck" : true ,
"moduleResolution" : "node" ,
"noEmit" : true ,
"allowImportingTsExtensions" : true
} ,
"ts-node" : { "esm" : true } ,
"include" : [ "**/*.ts" ] ,
"exclude" : [ "node_modules" ]
}
Lastly, create an index.ts
file at the root of our project. Within this file, we can load our .env
file and
initialize the OpenAI API client.
index.ts $el.removeAttribute('data-checked'), 2500);
" type="button" class="inline-flex items-center justify-center gap-2 whitespace-nowrap rounded focusable disabled:opacity-50 disabled:pointer-events-none motion-safe:transition-colors aria-disabled:opacity-50 aria-disabled:pointer-events-none bg-transparent text-accent-11 border-none hover:bg-accent-3 hover:text-accent-12 size-8 group absolute top-0.5 right-0.5"> import dotenv from 'dotenv' ;
import { Configuration , OpenAIApi } from 'openai' ;
dotenv .config ();
const configuration = new Configuration ({ apiKey : process . env . OPENAI_API_KEY });
const openai = new OpenAIApi (configuration);
Web Scraping Content
In order to summarize a blog post, we must first capture its contents. Start by using the Fetch API to get
all HTML content from the blog post’s webpage.
We must also pass a User-Agent
header to the request. Certain websites will block requests that do not have this. We
can generate one via the user-agents library.
index.ts $el.removeAttribute('data-checked'), 2500);
" type="button" class="inline-flex items-center justify-center gap-2 whitespace-nowrap rounded focusable disabled:opacity-50 disabled:pointer-events-none motion-safe:transition-colors aria-disabled:opacity-50 aria-disabled:pointer-events-none bg-transparent text-accent-11 border-none hover:bg-accent-3 hover:text-accent-12 size-8 group absolute top-0.5 right-0.5"> import dotenv from 'dotenv' ;
import { Configuration , OpenAIApi } from 'openai' ;
import UserAgent from 'user-agents' ;
dotenv .config ();
const configuration = new Configuration ({ apiKey : process . env . OPENAI_API_KEY });
const openai = new OpenAIApi (configuration);
const headers = { 'User-Agent' : new UserAgent ( /Chrome/ ) .toString () };
const url = 'https://openai.com/blog/introducing-chatgpt-and-whisper-apis' ;
const html = await fetch (url , { headers }) .then ((res) => res .text ());
Using Cheerio , the blog post content must be extracted from the the HTML. This is essential as we only want
to summarize the blog post content and not the entire page.
Web scraping can be tricky as each website is built different. You may need to adjust the selectors below to fit your
needs.
Also, OpenAI’s GPT has a max number of tokens that can be passed into it. This means we must truncate
our content to fit within this limit. Thankfully, we can leverage gpt-3-encoder to do all the heavy
lifting.
For this example, we will be capturing the first 8000 tokens. The max token count will vary based on the GPT model
your using. I recommend using GPT-4 as it has a much higher limit.
index.ts $el.removeAttribute('data-checked'), 2500);
" type="button" class="inline-flex items-center justify-center gap-2 whitespace-nowrap rounded focusable disabled:opacity-50 disabled:pointer-events-none motion-safe:transition-colors aria-disabled:opacity-50 aria-disabled:pointer-events-none bg-transparent text-accent-11 border-none hover:bg-accent-3 hover:text-accent-12 size-8 group absolute top-0.5 right-0.5"> import { load } from 'cheerio' ;
import dotenv from 'dotenv' ;
import { decode , encode } from 'gpt-3-encoder' ;
import { Configuration , OpenAIApi } from 'openai' ;
import UserAgent from 'user-agents' ;
dotenv .config ();
const configuration = new Configuration ({ apiKey : process . env . OPENAI_API_KEY });
const openai = new OpenAIApi (configuration);
const headers = { 'User-Agent' : new UserAgent ( /Chrome/ ) .toString () };
const url = 'https://openai.com/blog/introducing-chatgpt-and-whisper-apis' ;
const html = await fetch (url , { headers }) .then ((res) => res .text ());
const $ = load (html);
$ ( 'header, footer, aside, noscript' ) .remove ();
const content = $ ( 'main' ). length > 0 ? $ ( 'main p' ) : $ ( 'body p' );
const safeContent = decode ( encode ( content .text ()) .slice ( 0 , 8000 ));
Summarizing Content
Finally, we can pass our content to OpenAI’s GPT and receive a summary of the blog post. Keep in mind, the request may
take a few seconds to complete.
index.ts $el.removeAttribute('data-checked'), 2500);
" type="button" class="inline-flex items-center justify-center gap-2 whitespace-nowrap rounded focusable disabled:opacity-50 disabled:pointer-events-none motion-safe:transition-colors aria-disabled:opacity-50 aria-disabled:pointer-events-none bg-transparent text-accent-11 border-none hover:bg-accent-3 hover:text-accent-12 size-8 group absolute top-0.5 right-0.5"> import { load } from 'cheerio' ;
import dotenv from 'dotenv' ;
import { decode , encode } from 'gpt-3-encoder' ;
import { Configuration , OpenAIApi } from 'openai' ;
import UserAgent from 'user-agents' ;
dotenv .config ();
const configuration = new Configuration ({ apiKey : process . env . OPENAI_API_KEY });
const openai = new OpenAIApi (configuration);
const headers = { 'User-Agent' : new UserAgent ( /Chrome/ ) .toString () };
const url = 'https://openai.com/blog/introducing-chatgpt-and-whisper-apis' ;
const html = await fetch (url , { headers }) .then ((res) => res .text ());
const $ = load (html);
$ ( 'header, footer, aside, noscript' ) .remove ();
const content = $ ( 'main' ). length > 0 ? $ ( 'main p' ) : $ ( 'body p' );
const safeContent = decode ( encode ( content .text ()) .slice ( 0 , 8000 ));
const chatCompletion = await openai .createChatCompletion ({
model : 'gpt-4' , // or gpt-3.5-turbo
messages : [{ role : 'user' , content : `Summarize in 1 paragraph: ${ safeContent } ` }] ,
});
const summary = chatCompletion . data .choices[ 0 ]. message ?.content;
Here’s an example of what the summary may look like:
$el.removeAttribute('data-checked'), 2500);
" type="button" class="inline-flex items-center justify-center gap-2 whitespace-nowrap rounded focusable disabled:opacity-50 disabled:pointer-events-none motion-safe:transition-colors aria-disabled:opacity-50 aria-disabled:pointer-events-none bg-transparent text-accent-11 border-none hover:bg-accent-3 hover:text-accent-12 size-8 group absolute top-0.5 right-0.5"> ---
"OpenAI now offers developers access to ChatGPT and Whisper models through its
API, enabling the integration of cutting-edge language and speech-to-text
capabilities into their apps and products. Over time, OpenAI has achieved a 90%
cost reduction for ChatGPT and will pass on these savings to API users. The
Whisper large-v2 model is also available through the API for faster and
cost-effective results. Users can expect continuous model improvements and
access to dedicated capacity for deeper control over the models. Clients like
Snap Inc., Quizlet, Instacart, Shopify, and Speak are already taking advantage
of these APIs to create AI-powered solutions. OpenAI has also made changes to
its API terms of service in response to developer feedback."
---
In very few lines of code, we were able to pull all content from a blog post and present a detailed summary of that
content. As mentioned earlier, this allows quick consumption while retaining key concepts.
You can see this in action on my personal project feedjoy . Under each blog post exists a summary of its
content.