Generic web-scraper for eCommerce websites

Einklappen
X
 
  • Zeit
  • Anzeigen
Alles löschen
neue Beiträge
  • Tico
    Lox Guru
    • 31.08.2016
    • 1035

    #1

    Generic web-scraper for eCommerce websites

    I'm curious if anyone has employed a web-scraper on the Loxberry? By this I mean extracting data that is beyond a normal Loxone Virtual HTTP Input.

    My goal is to extract the price of my favourite coffee capsules and send an email alert/notification through Loxone if they go on sale. My automated house runs on caffeine before electrons...

    I've analysed the 'view page source' of the online shop but the price-point is completely obfuscated.

    I have found the attached JavaScript snippet that pulls the price from any given web-page. It does this surprisingly successfully.

    https://www.scrapehero.com/how-to-sc...merce-website/

    Installing Puppeteer and operating the script in a headless mode seems a viable option on the Loxberry. Unfortunately it's also beyond my meagre skills.
    Ich spreche kein Deutsch. Gib Google Translate die Schuld, wenn ich unverständlich bin.
  • Christian Fenzl
    Lebende Foren Legende
    • 31.08.2015
    • 11244

    #2
    There is a tutorial for Node.js as well. Node.js is installed on LoxBerry 2.x.
    Extracting a webpage is very generic. The website providers take several actions to obfuscate their content, beginning with minifying JavaScript, ending with special auth mechanisms and session cookies, to force the page to be viewed in a web browser.

    The so-called “generic” tutorial indeed is a very “specialized” tutorial to collect data of the specific example.
    If the page providers goal isn’t to obfuscate, usually you can fetch data directly from the html, or a json or xml response requested by the browser.
    Hilfe für die Menschen der Ukraine: https://www.loxforum.com/forum/proje...Cr-die-ukraine

    Kommentar

    • Tico
      Lox Guru
      • 31.08.2016
      • 1035

      #3
      So when I don't succeed with the path they propose, I might need to employ the ScrapeHero business for my needs...very clever of them! (sarcasm...)

      I did think it was a bit strange that a data scraping company was so forthcoming with self-defeating guidance.

      Notwithstanding, I did manage to use their script to get my coffee capsule price (with a minor tweak) -


      Klicke auf die Grafik für eine vergrößerte Ansicht

Name: Web scraper.png
Ansichten: 386
Größe: 26,6 KB
ID: 238372


      But as you say, this might be a very different challenge from a non-browser based intercept.
      Ich spreche kein Deutsch. Gib Google Translate die Schuld, wenn ich unverständlich bin.

      Kommentar

      • Tico
        Lox Guru
        • 31.08.2016
        • 1035

        #4
        I followed the tutorial for Puppeteer with Node.js. After a few errors and a bit of researching, the Loxberry is successfully downloading hotel names through Booking.com as per their example script.

        https://www.scrapehero.com/how-to-bu...r-and-node-js/

        A problem with their tutorial is that the installation of Puppeteer downloads it's own version of Chromium. That particular version is not compiled for ARM devices.

        The correct way for Loxberry is to download Chromium browser first, then Puppeteer-core.

        Code:
        sudo apt-get install chromium-browser chromium-codecs-ffmpeg
        Code:
        npm install puppeteer-core@v1.11.0
        Then change the initial part of the script to use Puppeteer-core and point to the Chromium browser.


        Working hotel scraping script in Puppeteer -
        PHP-Code:
        const puppeteer = require('puppeteer-core');
        
        let bookingUrl = 'https://www.booking.com/searchresults.en-gb.html......truncated';
        (async () => {
        const browser = await puppeteer.launch({
        executablePath: '/usr/bin/chromium-browser',
        headless: true
        });
        const page = await browser.newPage();
        await page.setViewport({ width: 1920, height: 926 });
        await page.goto(bookingUrl);
        
        // get hotel details
        let hotelData = await page.evaluate(() => {
        let hotels = [];
        // get the hotel elements
        let hotelsElms = document.querySelectorAll('div.sr_property_block[data-hotelid]');
        // get the hotel data
        hotelsElms.forEach((hotelelement) => {
        let hotelJson = {};
        try {
        hotelJson.name = hotelelement.querySelector('span.sr-hotel__name').innerText;
        hotelJson.reviews = hotelelement.querySelector('span.review-score-widget__subtext').innerText;
        hotelJson.rating = hotelelement.querySelector('span.review-score-badge').innerText;
        if(hotelelement.querySelector('strong.price')){
        hotelJson.price = hotelelement.querySelector('strong.price').innerText;
        }
        }
        catch (exception){
        
        }
        hotels.push(hotelJson);
        });
        return hotels;
        });
        
        console.dir(hotelData);
        })(); 
        

        But now I'm stuck....

        I can successfully copy/paste the coffee scraping script into the console in Chrome browser and see a result.

        I can successfully run the Loxberry Puppeteer installation with the example hotel scraping script.

        I'm having no luck guessing where to cut out the hotel specific part of the script and replace it with the coffee script in Puppeteer.

        I've tried the most obvious location at line 16 below // get the hotel elements. That's returning an error as pictured below.


        Working coffee scraping script in Chrome developer console -
        PHP-Code:
        let elements = [
        ...document.querySelectorAll(' body *')
        ]
        
        function createRecordFromElement(element) {
        const text = element.textContent.trim()
        var record = {}
        const bBox = element.getBoundingClientRect()
        
        if(text.length <= 30 && !(bBox.x == 0 && bBox.y == 0)) {
        record['fontSize'] = parseInt(getComputedStyle(element)['fontSize']) }
        record['y'] = bBox.y
        record['x'] = bBox.x
        record['text'] = text
        return record
        }
        let records = elements.map(createRecordFromElement)
        
        function canBePrice(record) {
        if( record['y'] > 600 ||
        record['fontSize'] == undefined ||
        !record['text'].match(/(^(US ){0,1}(rs\.|Rs\.|RS\.|\$||INR|USD|CAD|C\$){0,1}(\ s){0,1}[\d,]+(\.\d+){0,1}(\s){0,1}(AED){0,1}$)/)
        )
        return false
        else return true
        }
        
        let possiblePriceRecords = records.filter(canBePrice)
        let priceRecordsSortedByFontSize = possiblePriceRecords.sort(function(a, b) {
        if (a['fontSize'] == b['fontSize']) return a['y'] > b['y']
        return a['fontSize'] < b['fontSize']
        
        })
        console.log(priceRecordsSortedByFontSize[0]['text']);console.log(priceRecordsSortedByFontSize[1]['text']); 
        


        Non-working coffee scraping script in Puppeteer -
        PHP-Code:
        const puppeteer = require('puppeteer-core');
        
        let bookingUrl = 'https://shop.coles.com.au/a/dianella/product/moccona-coffee-capsules-espresso-7';
        (async () => {
        const browser = await puppeteer.launch({
        executablePath: '/usr/bin/chromium-browser',
        headless: true
        });
        const page = await browser.newPage();
        await page.setViewport({ width: 1920, height: 926 });
        await page.goto(bookingUrl);
        
        // get hotel details
        let hotelData = await page.evaluate(() => {
        let hotels = [];
        // get the hotel elements
        let elements = [
        ...document.querySelectorAll(' body *')
        ]
        
        function createRecordFromElement(element) {
        const text = element.textContent.trim()
        var record = {}
        const bBox = element.getBoundingClientRect()
        
        if(text.length <= 30 && !(bBox.x == 0 && bBox.y == 0)) {
        record['fontSize'] = parseInt(getComputedStyle(element)['fontSize']) }
        record['y'] = bBox.y
        record['x'] = bBox.x
        record['text'] = text
        return record
        }
        let records = elements.map(createRecordFromElement)
        
        function canBePrice(record) {
        if( record['y'] > 600 ||
        record['fontSize'] == undefined ||
        !record['text'].match(/(^(US ){0,1}(rs\.|Rs\.|RS\.|\$||INR|USD|CAD|C\$){0,1}(\ s){0,1}[\d,]+(\.\d+){0,1}(\s){0,1}(AED){0,1}$)/)
        )
        return false
        else return true
        }
        
        let possiblePriceRecords = records.filter(canBePrice)
        let priceRecordsSortedByFontSize = possiblePriceRecords.sort(function(a, b) {
        if (a['fontSize'] == b['fontSize']) return a['y'] > b['y']
        return a['fontSize'] < b['fontSize']
        
        })
        console.log(priceRecordsSortedByFontSize[0]['text']);
        })(); 
        
        Zuletzt geändert von Tico; 02.03.2020, 02:02.
        Ich spreche kein Deutsch. Gib Google Translate die Schuld, wenn ich unverständlich bin.

        Kommentar

        • Prof.Mobilux
          Supermoderator
          • 25.08.2015
          • 4871

          #5
          In the meantime we (Christian ;-)) introduced LoxBerry XL - EXtended Logic - which from my point of view can be used exactly for that purpose. The whole PHP world is possible there.



          Zuletzt geändert von Prof.Mobilux; 13.06.2023, 04:18.
          🇺🇦 Hilfe für die Menschen der Ukraine: https://www.loxforum.com/forum/proje...Cr-die-ukraine


          LoxBerry - Beyond the Limits

          Kommentar

          • Tico
            Lox Guru
            • 31.08.2016
            • 1035

            #6
            I find the inputs from nwoguvivian94 and AlanGuzman slightly ChatGPT inspired. Very low contribution counts and somewhat 'vanilla' responses garnered from previous thread content.

            If they are bots, they are very good. If they are human, then I've likely caused offence .

            Curious to see the next installment of discussion!
            Ich spreche kein Deutsch. Gib Google Translate die Schuld, wenn ich unverständlich bin.

            Kommentar


            • Prof.Mobilux
              Prof.Mobilux kommentierte
              Kommentar bearbeiten
              Thanks for the hint. ;-)
          Lädt...