Web scraper lambda
AWS SAM application and client for scraping website og meta tags and content via css selectors.
Prerequisites
- Git
- AWS SAM Cli
Lambda deploy
$ git clone https://github.com/68publishers/web-scraper-lambda.git
$ cd web-scraper-lambda
$ sam build
$ sam deploy --guided
Client installation
The first option is to download the client as a module.
$ npm i --save @68publishers/web-scraper-client
# or
$ yarn add @68publishers/web-scraper-client
And import it in your project.
import WebScraperClient from '@68publishers/web-scraper-client';
// or
const WebScraperClient = require('@68publishers/web-scraper-client');
Or you can import the client into the browser from the CDN
<script src="https://unpkg.com/@68publishers/web-scraper-client/dist/web-scraper-client.min.js"></script>
Client usage
The client must be initialized with the URL of your lambda function and an optional configuration object.
var client = new WebScraperClient(
'https://<gateway>.execute-api.<region>.amazonaws.com/<stage>/scrap',
{} // optional configuration
);
Optional configuration values table:
Option path | Type | Default | Description |
---|---|---|---|
cache.storage |
null or storage |
null |
Pass localStorage or sessionStorage or any compatible storage for enabled caching. |
cache.ttl |
int |
3600 |
Cache expiration in seconds. |
cache.prefix |
string |
"web-scraper-cache:" |
Prefix for cache item keys. |
To scrap data from a web page, call the scrap
method with the desired URL. You can use the second optional queries
argument to retrieve additional data.
The value of the argument should be an object whose keys are arbitrary names and whose values are CSS selectors, such as #main > .header > .title
.
If you need an attribute value, add @attributeName
to the end of the selector, for example #gallery > img @src
.
// get only og meta tags
client.scrap('https://wwww.website-to-scrap.com/test')
.then(response => {
// do anything with parsed response
})
.catch(e => {
// whoops
});
// get og meta tags and some additional data
client.scrap(
'https://wwww.website-to-scrap.com/test',
{
pageLinks: "a @href",
galleryImages: "#product_gallery img @src",
productName: "#main .product-card > .product-name",
}
).then(response => {
// do anything with parsed response
}).catch(e => {
// whoops
});
Response object
The response object contains all parsed meta tags and "queries".
client.scrap(/*...*/).then(response => {
var url = response.requestUrl; // url from which the data was scraped
var allMeta = response.meta(); // returns all found og meta tags
var ogTitle = response.meta('ogTitle', ''); // return the specific meta tag, the second argument is the default value
var pageLinks = response.queryValues('pageLinks', []) // return all found page links
var galleryImages = response.queryValues('galleryImages', []) // return all gallery images
var productName = response.queryValue('productName', 'Unknown product'); // the method `queryValue` returns the first value in an array
var productNameError = response.queryError('productName'); // the method `queryError` returns an error message (for example if passed css selector is invalid) or false
});
Response caching
The cache must be enabled in the client configuration.
var client = new WebScraperClient(
'https://<gateway>.execute-api.<region>.amazonaws.com/<stage>/scrap',
{
cache: {
storage: window.sessionStorage, // or window.localStorage
ttl: 3600, // expiration in seconds
prefix: 'web-scraper-cache:', // prefix for cache keys
},
}
);