HTMLGrabr library
A Node.js library to grab and clean HTML content.
Features
- Extract page content from an URL (
HTMLGrabr.grabURL(url: URL)
)
- Extract page content from a string (
HTMLGrabr.grab(s: string)
)
- Clean the page content:
- Extract main content using
node-readability
- Extract text content using
html2plaintext
- Remove link or image references of blacklisted sites
- Remove pixel tracker
- Remove unwanted attributes (by default:
class
and id
)
- Extract Open Graph properties
Usage
npm install --save htmlgrabr
The in your code:
const HTMLGrabr = require('htmlgrabr')
const { URL } = require('url')
const grabber = new HTMLGrabr()
grabber.grabUrl(new URL('http://keeper.nunux.org'))
.then(page => {
console.log(page)
}, err => {
console.log(err)
})
Variables
Const DefaultFilterChain
Default
FilterChain: FilterFunc[] = [removeAttributes(['id', 'class']),removeImageTraker(),externalizeLinks(),// moveAttribute('data-src', 'src')]
Const db
db: Set<string> = new Set(['doubleclick.net','feeds.feedburner.com'])
Const h2p
h2p: any = require('html2plaintext')
Const pretty
pretty: any = require('pretty')
Const readability
readability: Function = promisify(require('node-readability'))
Functions
clean
-
Parameters
-
doc: Document
-
-
Default value filters: FilterFunc[] = DefaultFilterChain
Returns void
extractBaseUrl
- extractBaseUrl(doc: Document): string | null
-
Parameters
Returns string
|
null
the base URL
extractImages
- extractImages(doc: Document, illustration?: undefined | string): ImageMeta[]
-
Parameters
-
doc: Document
-
Optional illustration: undefined | string
array of image meta data
extractOpenGraphProps
-
Parameters
Open Graph properties
isBlacklisted
- isBlacklisted(hostname: string): boolean
-
Parameters
Returns boolean
Const isValidUrl
- isValidUrl(url: string): boolean
-
Parameters
Returns boolean
moveAttribute
- moveAttribute(attr1: string, attr2: string): FilterFunc
-
Parameters
-
attr1: string
-
attr2: string
the filtering function
rebaseSrcAttribute
-
Parameters
the filtering function
removeAttributes
-
Parameters
the filtering function
removeBlacklistedLinks
-
Parameters
the filtering function
Object literals
Const DefaultConfig
DefaultConfig: object
debug
debug: false = false
headers
headers: Headers = new Headers({'User-Agent': 'Mozilla/5.0 (compatible; HTMLGrabr/1.0)',})
Clean a DOM using a filter chain.