Options
All
  • Public
  • Public/Protected
  • All
Menu

HTMLGrabr library

Travis Coverage Status Donate

A Node.js library to grab and clean HTML content.

Features

  • Extract page content from an URL (HTMLGrabr.grabURL(url: URL))
  • Extract page content from a string (HTMLGrabr.grab(s: string))
  • Clean the page content:
    • Extract main content using node-readability
    • Extract text content using html2plaintext
    • Remove link or image references of blacklisted sites
    • Remove pixel tracker
    • Remove unwanted attributes (by default: class and id)
  • Extract Open Graph properties

Usage

npm install --save htmlgrabr

The in your code:

const HTMLGrabr = require('htmlgrabr')
const { URL } = require('url')

const grabber = new HTMLGrabr()

grabber.grabUrl(new URL('http://keeper.nunux.org'))
  .then(page => {
    console.log(page)
  }, err => {
    console.log(err)
  })

Index

Variables

Const DefaultFilterChain

DefaultFilterChain: FilterFunc[] = [removeAttributes(['id', 'class']),removeImageTraker(),externalizeLinks(),// moveAttribute('data-src', 'src')]

Const db

db: Set<string> = new Set(['doubleclick.net','feeds.feedburner.com'])

Const h2p

h2p: any = require('html2plaintext')

Const pretty

pretty: any = require('pretty')

Const readability

readability: Function = promisify(require('node-readability'))

Functions

clean

  • Clean a DOM using a filter chain.

    Parameters

    • doc: Document

      DOM to clean

    • props: CleanupProps

      properties used by filters

    • Default value filters: FilterFunc[] = DefaultFilterChain

      chain filter

    Returns void

externalizeLinks

extractBaseUrl

  • extractBaseUrl(doc: Document): string | null
  • Extract base URL from headers of a DOM.

    Parameters

    • doc: Document

      DOM to process

    Returns string | null

    the base URL

extractImages

  • extractImages(doc: Document, illustration?: undefined | string): ImageMeta[]
  • Extract all images from a DOM.

    Parameters

    • doc: Document

      DOM to process

    • Optional illustration: undefined | string

      if proviided, the illustration is added to the result

    Returns ImageMeta[]

    array of image meta data

extractOpenGraphProps

  • Extract Open Graph properties from headers of a DOM.

    Parameters

    • doc: Document

      DOM to process

    Returns OpenGraphProps

    Open Graph properties

isBlacklisted

  • isBlacklisted(hostname: string): boolean

Const isValidUrl

  • isValidUrl(url: string): boolean

moveAttribute

  • moveAttribute(attr1: string, attr2: string): FilterFunc
  • Move attribute to another.

    Parameters

    • attr1: string

      source attribute to remove

    • attr2: string

      target attribute

    Returns FilterFunc

    the filtering function

rebaseSrcAttribute

  • Update relative link to be absolute.

    Parameters

    • baseURL: string

      the base URL used to make the link absolute

    Returns FilterFunc

    the filtering function

removeAttributes

  • removeAttributes(blacklist: string[]): FilterFunc
  • Remove all blacklisted attributes from a HTML element.

    Parameters

    • blacklist: string[]

      the list of attributes to remove

    Returns FilterFunc

    the filtering function

removeBlacklistedLinks

removeImageTraker

Object literals

Const DefaultConfig

DefaultConfig: object

debug

debug: false = false

headers

headers: Headers = new Headers({'User-Agent': 'Mozilla/5.0 (compatible; HTMLGrabr/1.0)',})

isBacklisted

isBacklisted: isBlacklisted = isBlacklisted

Legend

  • Module
  • Object literal
  • Variable
  • Function
  • Function with type parameter
  • Index signature
  • Type alias
  • Enumeration
  • Enumeration member
  • Property
  • Method
  • Interface
  • Interface with type parameter
  • Constructor
  • Property
  • Method
  • Index signature
  • Class
  • Class with type parameter
  • Constructor
  • Property
  • Method
  • Accessor
  • Index signature
  • Inherited constructor
  • Inherited property
  • Inherited method
  • Inherited accessor
  • Protected property
  • Protected method
  • Protected accessor
  • Private property
  • Private method
  • Private accessor
  • Static property
  • Static method

Generated using TypeDoc