Generic web-scraper for eCommerce websites

**Christian Fenzl** · 01.03.2020, 08:11

There is a tutorial for Node.js as well. Node.js is installed on LoxBerry 2.x.
Extracting a webpage is very generic. The website providers take several actions to obfuscate their content, beginning with minifying JavaScript, ending with special auth mechanisms and session cookies, to force the page to be viewed in a web browser.

The so-called “generic” tutorial indeed is a very “specialized” tutorial to collect data of the specific example.
If the page providers goal isn’t to obfuscate, usually you can fetch data directly from the html, or a json or xml response requested by the browser.

**Tico** · 01.03.2020, 10:30

So when I don't succeed with the path they propose, I might need to employ the ScrapeHero business for my needs...very clever of them! (sarcasm...)

I did think it was a bit strange that a data scraping company was so forthcoming with self-defeating guidance.

Notwithstanding, I did manage to use their script to get my coffee capsule price (with a minor tweak) -

But as you say, this might be a very different challenge from a non-browser based intercept.

**Tico** · 02.03.2020, 01:07

I followed the tutorial for Puppeteer with Node.js. After a few errors and a bit of researching, the Loxberry is successfully downloading hotel names through Booking.com as per their example script.

https://www.scrapehero.com/how-to-bu...r-and-node-js/

A problem with their tutorial is that the installation of Puppeteer downloads it's own version of Chromium. That particular version is not compiled for ARM devices.

The correct way for Loxberry is to download Chromium browser first, then Puppeteer-core.

Code:

sudo apt-get install chromium-browser chromium-codecs-ffmpeg

Code:

npm install puppeteer-core@v1.11.0

Then change the initial part of the script to use Puppeteer-core and point to the Chromium browser.

Working hotel scraping script in Puppeteer -

PHP-Code:

const puppeteer = require('puppeteer-core');

let bookingUrl = 'https://www.booking.com/searchresults.en-gb.html......truncated';
(async () => {
const browser = await puppeteer.launch({
executablePath: '/usr/bin/chromium-browser',
headless: true
});
const page = await browser.newPage();
await page.setViewport({ width: 1920, height: 926 });
await page.goto(bookingUrl);

// get hotel details
let hotelData = await page.evaluate(() => {
let hotels = [];
// get the hotel elements
let hotelsElms = document.querySelectorAll('div.sr_property_block[data-hotelid]');
// get the hotel data
hotelsElms.forEach((hotelelement) => {
let hotelJson = {};
try {
hotelJson.name = hotelelement.querySelector('span.sr-hotel__name').innerText;
hotelJson.reviews = hotelelement.querySelector('span.review-score-widget__subtext').innerText;
hotelJson.rating = hotelelement.querySelector('span.review-score-badge').innerText;
if(hotelelement.querySelector('strong.price')){
hotelJson.price = hotelelement.querySelector('strong.price').innerText;
}
}
catch (exception){

}
hotels.push(hotelJson);
});
return hotels;
});

console.dir(hotelData);
})();

But now I'm stuck....

I can successfully copy/paste the coffee scraping script into the console in Chrome browser and see a result.

I can successfully run the Loxberry Puppeteer installation with the example hotel scraping script.

I'm having no luck guessing where to cut out the hotel specific part of the script and replace it with the coffee script in Puppeteer.

I've tried the most obvious location at line 16 below // get the hotel elements. That's returning an error as pictured below.

Working coffee scraping script in Chrome developer console -

PHP-Code:

let elements = [
...document.querySelectorAll(' body *')
]

function createRecordFromElement(element) {
const text = element.textContent.trim()
var record = {}
const bBox = element.getBoundingClientRect()

if(text.length <= 30 && !(bBox.x == 0 && bBox.y == 0)) {
record['fontSize'] = parseInt(getComputedStyle(element)['fontSize']) }
record['y'] = bBox.y
record['x'] = bBox.x
record['text'] = text
return record
}
let records = elements.map(createRecordFromElement)

function canBePrice(record) {
if( record['y'] > 600 ||
record['fontSize'] == undefined ||
!record['text'].match(/(^(US ){0,1}(rs\.|Rs\.|RS\.|\$|₹|INR|USD|CAD|C\$){0,1}(\ s){0,1}[\d,]+(\.\d+){0,1}(\s){0,1}(AED){0,1}$)/)
)
return false
else return true
}

let possiblePriceRecords = records.filter(canBePrice)
let priceRecordsSortedByFontSize = possiblePriceRecords.sort(function(a, b) {
if (a['fontSize'] == b['fontSize']) return a['y'] > b['y']
return a['fontSize'] < b['fontSize']

})
console.log(priceRecordsSortedByFontSize[0]['text']);console.log(priceRecordsSortedByFontSize[1]['text']);

Non-working coffee scraping script in Puppeteer -

PHP-Code:

const puppeteer = require('puppeteer-core');

let bookingUrl = 'https://shop.coles.com.au/a/dianella/product/moccona-coffee-capsules-espresso-7';
(async () => {
const browser = await puppeteer.launch({
executablePath: '/usr/bin/chromium-browser',
headless: true
});
const page = await browser.newPage();
await page.setViewport({ width: 1920, height: 926 });
await page.goto(bookingUrl);

// get hotel details
let hotelData = await page.evaluate(() => {
let hotels = [];
// get the hotel elements
let elements = [
...document.querySelectorAll(' body *')
]

function createRecordFromElement(element) {
const text = element.textContent.trim()
var record = {}
const bBox = element.getBoundingClientRect()

if(text.length <= 30 && !(bBox.x == 0 && bBox.y == 0)) {
record['fontSize'] = parseInt(getComputedStyle(element)['fontSize']) }
record['y'] = bBox.y
record['x'] = bBox.x
record['text'] = text
return record
}
let records = elements.map(createRecordFromElement)

function canBePrice(record) {
if( record['y'] > 600 ||
record['fontSize'] == undefined ||
!record['text'].match(/(^(US ){0,1}(rs\.|Rs\.|RS\.|\$|₹|INR|USD|CAD|C\$){0,1}(\ s){0,1}[\d,]+(\.\d+){0,1}(\s){0,1}(AED){0,1}$)/)
)
return false
else return true
}

let possiblePriceRecords = records.filter(canBePrice)
let priceRecordsSortedByFontSize = possiblePriceRecords.sort(function(a, b) {
if (a['fontSize'] == b['fontSize']) return a['y'] > b['y']
return a['fontSize'] < b['fontSize']

})
console.log(priceRecordsSortedByFontSize[0]['text']);
})();

**Prof.Mobilux** · 12.06.2023, 19:03

In the meantime we (Christian ;-)) introduced LoxBerry XL - EXtended Logic - which from my point of view can be used exactly for that purpose. The whole PHP world is possible there.

LoxBerry XL - EXtended Logic [LoxBerry Wiki - BEYOND THE LIMITS]

https://wiki.loxberry.de/konfiguration/loxberry_xl_extended_logic/start

konfiguration,loxberry_xl_extended_logic,start

**Tico** · 13.06.2023, 00:29

I find the inputs from nwoguvivian94 and AlanGuzman slightly ChatGPT inspired. Very low contribution counts and somewhat 'vanilla' responses garnered from previous thread content.

If they are bots, they are very good. If they are human, then I've likely caused offence

.

Curious to see the next installment of discussion!

Generic web-scraper for eCommerce websites

Generic web-scraper for eCommerce websites

Kommentar

Kommentar

Kommentar

Kommentar

Kommentar