2025-01-21 22:16:47 [scrapy.utils.log] (PID: 335) INFO: Scrapy 2.11.2 started (bot: catalog_extraction) 2025-01-21 22:16:47 [scrapy.utils.log] (PID: 335) INFO: Versions: lxml 5.2.2.0, libxml2 2.12.6, cssselect 1.2.0, parsel 1.9.1, w3lib 2.1.2, Twisted 24.3.0, Python 3.11.11 (main, Dec 4 2024, 20:38:25) [GCC 12.2.0], pyOpenSSL 24.1.0 (OpenSSL 3.2.1 30 Jan 2024), cryptography 42.0.7, Platform Linux-6.4.10-dirty-x86_64-with-glibc2.36 2025-01-21 22:16:47 [wb_mason] (PID: 335) INFO: Starting extraction spider wb_mason... 2025-01-21 22:16:47 [scrapy.addons] (PID: 335) INFO: Enabled addons: [] 2025-01-21 22:16:47 [scrapy.extensions.telnet] (PID: 335) INFO: Telnet Password: 99ef1aedd4e94354 2025-01-21 22:16:47 [scrapy.middleware] (PID: 335) INFO: Enabled extensions: ['scrapy.extensions.corestats.CoreStats', 'scrapy.extensions.telnet.TelnetConsole', 'scrapy.extensions.closespider.CloseSpider', 'scrapy.extensions.feedexport.FeedExporter', 'scrapy.extensions.logstats.LogStats', 'scrapy_playwright.memusage.ScrapyPlaywrightMemoryUsageExtension', 'spidermon.contrib.scrapy.extensions.Spidermon'] 2025-01-21 22:16:47 [scrapy.crawler] (PID: 335) INFO: Overridden settings: {'BOT_NAME': 'catalog_extraction', 'CONCURRENT_ITEMS': 250, 'CONCURRENT_REQUESTS': 24, 'DOWNLOAD_DELAY': 1.25, 'FEED_EXPORT_ENCODING': 'utf-8', 'HTTPPROXY_ENABLED': False, 'LOG_FILE': '/var/lib/scrapyd/logs/catalog_extraction/wb_mason/64f4c496d84511efba394200a9fe0102.log', 'LOG_FORMAT': '%(asctime)s [%(name)s] (PID: %(process)d) %(levelname)s: ' '%(message)s', 'LOG_LEVEL': 'INFO', 'NEWSPIDER_MODULE': 'catalog_extraction.spiders', 'REQUEST_FINGERPRINTER_CLASS': 'scrapy_poet.ScrapyPoetRequestFingerprinter', 'REQUEST_FINGERPRINTER_IMPLEMENTATION': '2.7', 'RETRY_TIMES': 5, 'SPIDER_MODULES': ['catalog_extraction.spiders'], 'TWISTED_REACTOR': 'twisted.internet.asyncioreactor.AsyncioSelectorReactor', 'USER_AGENT': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, ' 'like Gecko) Chrome/129.0.0.0 Safari/537.36'} 2025-01-21 22:16:47 [scrapy-playwright] (PID: 335) WARNING: Connecting to remote browser, ignoring PLAYWRIGHT_LAUNCH_OPTIONS 2025-01-21 22:16:47 [scrapy-playwright] (PID: 335) WARNING: Connecting to remote browser, ignoring PLAYWRIGHT_LAUNCH_OPTIONS 2025-01-21 22:16:47 [scrapy_poet.injection] (PID: 335) INFO: Loading providers: [, , , , , , ] 2025-01-21 22:16:47 [scrapy.middleware] (PID: 335) INFO: Enabled downloader middlewares: ['scrapy.downloadermiddlewares.offsite.OffsiteMiddleware', 'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware', 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware', 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware', 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware', 'scrapy_poet.InjectionMiddleware', 'scrapy.downloadermiddlewares.retry.RetryMiddleware', 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware', 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware', 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware', 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware', 'scrapy_poet.DownloaderStatsMiddleware'] 2025-01-21 22:16:47 [wb_mason] (PID: 335) WARNING: Missing 'PARSING_ERRORS_STORAGE'. Middleware will NOT store the HTML of pages with parsing errors. 2025-01-21 22:16:48 [scrapy.middleware] (PID: 335) INFO: Enabled spider middlewares: ['catalog_extraction.middlewares.ErrorHandlerSpiderMiddleware', 'catalog_extraction.middlewares.FixtureSavingMiddleware', 'scrapy_poet.RetryMiddleware', 'scrapy.spidermiddlewares.referer.RefererMiddleware', 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware', 'scrapy.spidermiddlewares.depth.DepthMiddleware'] 2025-01-21 22:16:48 [scrapy.middleware] (PID: 335) INFO: Enabled item pipelines: ['catalog_extraction.pipelines.DuplicatedSKUsFilterPipeline', 'catalog_extraction.pipelines.DiscontinuedProductsAdjustmentPipeline', 'catalog_extraction.pipelines.PriceRoundingPipeline', 'scraping_utils.pipelines.AttachSupplierPipeline', 'spidermon.contrib.scrapy.pipelines.ItemValidationPipeline'] 2025-01-21 22:16:48 [scrapy.core.engine] (PID: 335) INFO: Spider opened 2025-01-21 22:16:48 [scrapy.extensions.closespider] (PID: 335) INFO: Spider will stop when no items are produced after 7200 seconds. 2025-01-21 22:16:48 [scrapy.extensions.logstats] (PID: 335) INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min) 2025-01-21 22:16:48 [scrapy.extensions.telnet] (PID: 335) INFO: Telnet console listening on 127.0.0.1:6023 2025-01-21 22:16:48 [scrapy-playwright] (PID: 335) INFO: Starting download handler 2025-01-21 22:16:48 [scrapy-playwright] (PID: 335) INFO: Starting download handler 2025-01-21 22:16:53 [wb_mason] (PID: 335) INFO: Received 'product_urls': https://www.wbmason.com/ProductDetail.aspx?ItemDesc=Coffee-Mate-Liquid-Coffee-Creamer-Original-038-oz-Single-Serve-Cups-360-Case 2025-01-21 22:16:53 [scrapy-playwright] (PID: 335) INFO: Connecting using CDP: https://brd-customer-hl_13cda1e4-zone-main_scraping_browser:l9p73ctebkrc@brd.superproxy.io:9222 2025-01-21 22:16:53 [scrapy-playwright] (PID: 335) INFO: Connected using CDP: https://brd-customer-hl_13cda1e4-zone-main_scraping_browser:l9p73ctebkrc@brd.superproxy.io:9222 2025-01-21 22:17:48 [scrapy.extensions.logstats] (PID: 335) INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min) 2025-01-21 22:18:25 [scrapy-playwright] (PID: 335) WARNING: Closing page due to failed request: exc_type= exc_msg=Page.wait_for_selector: Timeout 30000ms exceeded. Call log: waiting for locator(".row.pos-relative") to be visible Traceback (most recent call last): File "/usr/local/lib/python3.11/site-packages/scrapy_playwright/handler.py", line 436, in _download_request_with_retry return await self._download_request_with_page(request, page, spider) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/site-packages/scrapy_playwright/handler.py", line 484, in _download_request_with_page await self._apply_page_methods(page, request, spider) File "/usr/local/lib/python3.11/site-packages/scrapy_playwright/handler.py", line 630, in _apply_page_methods pm.result = await _maybe_await(method(*pm.args, **pm.kwargs)) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/site-packages/scrapy_playwright/_utils.py", line 21, in _maybe_await return await obj ^^^^^^^^^ File "/usr/local/lib/python3.11/site-packages/playwright/async_api/_generated.py", line 7831, in wait_for_selector await self._impl_obj.wait_for_selector( File "/usr/local/lib/python3.11/site-packages/playwright/_impl/_page.py", line 392, in wait_for_selector return await self._main_frame.wait_for_selector(**locals_to_params(locals())) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/site-packages/playwright/_impl/_frame.py", line 323, in wait_for_selector await self._channel.send("waitForSelector", locals_to_params(locals())) File "/usr/local/lib/python3.11/site-packages/playwright/_impl/_connection.py", line 59, in send return await self._connection.wrap_api_call( ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/site-packages/playwright/_impl/_connection.py", line 514, in wrap_api_call raise rewrite_error(error, f"{parsed_st['apiName']}: {error}") from None playwright._impl._errors.TimeoutError: Page.wait_for_selector: Timeout 30000ms exceeded. Call log: waiting for locator(".row.pos-relative") to be visible 2025-01-21 22:18:26 [scrapy.core.scraper] (PID: 335) ERROR: Error downloading Traceback (most recent call last): File "/usr/local/lib/python3.11/site-packages/twisted/internet/defer.py", line 1999, in _inlineCallbacks result = context.run( File "/usr/local/lib/python3.11/site-packages/twisted/python/failure.py", line 519, in throwExceptionIntoGenerator return g.throw(self.value.with_traceback(self.tb)) File "/usr/local/lib/python3.11/site-packages/scrapy/core/downloader/middleware.py", line 54, in process_request return (yield download_func(request=request, spider=spider)) File "/usr/local/lib/python3.11/site-packages/twisted/internet/defer.py", line 1251, in adapt extracted: _SelfResultT | Failure = result.result() File "/usr/local/lib/python3.11/site-packages/scrapy_playwright/handler.py", line 383, in _download_request return await self._download_request_with_retry(request=request, spider=spider) File "/usr/local/lib/python3.11/site-packages/scrapy_playwright/handler.py", line 436, in _download_request_with_retry return await self._download_request_with_page(request, page, spider) File "/usr/local/lib/python3.11/site-packages/scrapy_playwright/handler.py", line 484, in _download_request_with_page await self._apply_page_methods(page, request, spider) File "/usr/local/lib/python3.11/site-packages/scrapy_playwright/handler.py", line 630, in _apply_page_methods pm.result = await _maybe_await(method(*pm.args, **pm.kwargs)) File "/usr/local/lib/python3.11/site-packages/scrapy_playwright/_utils.py", line 21, in _maybe_await return await obj File "/usr/local/lib/python3.11/site-packages/playwright/async_api/_generated.py", line 7831, in wait_for_selector await self._impl_obj.wait_for_selector( File "/usr/local/lib/python3.11/site-packages/playwright/_impl/_page.py", line 392, in wait_for_selector return await self._main_frame.wait_for_selector(**locals_to_params(locals())) File "/usr/local/lib/python3.11/site-packages/playwright/_impl/_frame.py", line 323, in wait_for_selector await self._channel.send("waitForSelector", locals_to_params(locals())) File "/usr/local/lib/python3.11/site-packages/playwright/_impl/_connection.py", line 59, in send return await self._connection.wrap_api_call( File "/usr/local/lib/python3.11/site-packages/playwright/_impl/_connection.py", line 514, in wrap_api_call raise rewrite_error(error, f"{parsed_st['apiName']}: {error}") from None playwright._impl._errors.TimeoutError: Page.wait_for_selector: Timeout 30000ms exceeded. Call log: waiting for locator(".row.pos-relative") to be visible 2025-01-21 22:18:26 [scrapy.core.engine] (PID: 335) INFO: Closing spider (finished) 2025-01-21 22:18:26 [wb_mason] (PID: 335) INFO: [Spidermon] ------------------------------ MONITORS ------------------------------ 2025-01-21 22:18:26 [wb_mason] (PID: 335) INFO: [Spidermon] Extracted Items Monitor/test_stat_monitor... FAIL 2025-01-21 22:18:26 [wb_mason] (PID: 335) INFO: [Spidermon] Item Validation Monitor/test_stat_monitor... SKIPPED (Unable to find 'spidermon/validation/fields/errors' in job stats.) 2025-01-21 22:18:26 [wb_mason] (PID: 335) INFO: [Spidermon] Error Count Monitor/test_stat_monitor... FAIL 2025-01-21 22:18:26 [wb_mason] (PID: 335) INFO: [Spidermon] Warning Count Monitor/test_stat_monitor... FAIL 2025-01-21 22:18:26 [wb_mason] (PID: 335) INFO: [Spidermon] Finish Reason Monitor/Should have the expected finished reason(s)... OK 2025-01-21 22:18:26 [wb_mason] (PID: 335) INFO: [Spidermon] Unwanted HTTP codes monitor/Should not hit the limit of unwanted http status... OK 2025-01-21 22:18:26 [wb_mason] (PID: 335) INFO: [Spidermon] Field Coverage Monitor/test_check_if_field_coverage_rules_are_met... FAIL 2025-01-21 22:18:26 [wb_mason] (PID: 335) INFO: [Spidermon] Retry Count monitor/Should not hit the limit of requests that reached the maximum retry amount... OK 2025-01-21 22:18:26 [wb_mason] (PID: 335) INFO: [Spidermon] Downloader Exceptions monitor/test_stat_monitor... OK 2025-01-21 22:18:26 [wb_mason] (PID: 335) INFO: [Spidermon] Successful Requests monitor/Should have at least the minimum number of successful requests... OK 2025-01-21 22:18:26 [wb_mason] (PID: 335) INFO: [Spidermon] Total Requests monitor/Should not hit the total limit of requests... OK 2025-01-21 22:18:26 [wb_mason] (PID: 335) INFO: [Spidermon] ---------------------------------------------------------------------- 2025-01-21 22:18:26 [wb_mason] (PID: 335) ERROR: [Spidermon] ====================================================================== FAIL: Extracted Items Monitor/test_stat_monitor ---------------------------------------------------------------------- Traceback (most recent call last): File "/usr/local/lib/python3.11/site-packages/spidermon/contrib/scrapy/monitors/base.py", line 177, in test_stat_monitor self.fail(message) AssertionError: Unable to find 'item_scraped_count' in job stats. 2025-01-21 22:18:26 [wb_mason] (PID: 335) ERROR: [Spidermon] ====================================================================== FAIL: Error Count Monitor/test_stat_monitor ---------------------------------------------------------------------- Traceback (most recent call last): File "/usr/local/lib/python3.11/site-packages/spidermon/contrib/scrapy/monitors/base.py", line 184, in test_stat_monitor assertion_method( AssertionError: Expecting 'log_count/ERROR' to be '<=' to '0.0'. Current value: '1' 2025-01-21 22:18:26 [wb_mason] (PID: 335) ERROR: [Spidermon] ====================================================================== FAIL: Warning Count Monitor/test_stat_monitor ---------------------------------------------------------------------- Traceback (most recent call last): File "/usr/local/lib/python3.11/site-packages/spidermon/contrib/scrapy/monitors/base.py", line 184, in test_stat_monitor assertion_method( AssertionError: Expecting 'log_count/WARNING' to be '<=' to '1.0'. Current value: '4' 2025-01-21 22:18:26 [wb_mason] (PID: 335) ERROR: [Spidermon] ====================================================================== FAIL: Field Coverage Monitor/test_check_if_field_coverage_rules_are_met ---------------------------------------------------------------------- Traceback (most recent call last): File "/usr/local/lib/python3.11/site-packages/spidermon/contrib/scrapy/monitors/monitors.py", line 476, in test_check_if_field_coverage_rules_are_met self.assertTrue(len(failures) == 0, msg=msg) AssertionError: The following items did not meet field coverage rules: dict/inStock (expected 1.0, got 0) dict/name (expected 1.0, got 0) dict/prices (expected 1.0, got 0) dict/productStatus (expected 1.0, got 0) dict/supplier (expected 1.0, got 0) dict/supplierSku (expected 1.0, got 0) dict/url (expected 1.0, got 0) 2025-01-21 22:18:26 [wb_mason] (PID: 335) INFO: [Spidermon] 11 monitors in 0.003s 2025-01-21 22:18:26 [wb_mason] (PID: 335) INFO: [Spidermon] FAILED (failures=4, skipped=1) 2025-01-21 22:18:26 [wb_mason] (PID: 335) INFO: [Spidermon] -------------------------- FINISHED ACTIONS -------------------------- 2025-01-21 22:18:26 [wb_mason] (PID: 335) INFO: [Spidermon] ---------------------------------------------------------------------- 2025-01-21 22:18:26 [wb_mason] (PID: 335) INFO: [Spidermon] 0 actions in 0.000s 2025-01-21 22:18:26 [wb_mason] (PID: 335) INFO: [Spidermon] OK 2025-01-21 22:18:26 [wb_mason] (PID: 335) INFO: [Spidermon] --------------------------- PASSED ACTIONS --------------------------- 2025-01-21 22:18:26 [wb_mason] (PID: 335) INFO: [Spidermon] ---------------------------------------------------------------------- 2025-01-21 22:18:26 [wb_mason] (PID: 335) INFO: [Spidermon] 0 actions in 0.000s 2025-01-21 22:18:26 [wb_mason] (PID: 335) INFO: [Spidermon] OK 2025-01-21 22:18:26 [wb_mason] (PID: 335) INFO: [Spidermon] --------------------------- FAILED ACTIONS --------------------------- 2025-01-21 22:18:26 [spidermon.contrib.actions.slack] (PID: 335) INFO: :skull: `wb_mason` *spider finished with errors!* _(errors=4)_ 2025-01-21 22:18:26 [spidermon.contrib.actions.slack] (PID: 335) INFO: [ { "text": "• _Extracted Items Monitor/test_stat_monitor_: Unable to find 'item_scraped_count' in job stats.\n• _Error Count Monitor/test_stat_monitor_: Expecting 'log_count/ERROR' to be '<=' to '0.0'. Current value: '1'\n• _Warning Count Monitor/test_stat_monitor_: Expecting 'log_count/WARNING' to be '<=' to '1.0'. Current value: '4'\n• _Field Coverage Monitor/test_check_if_field_coverage_rules_are_met_: The following items did not meet field coverage rules: dict/inStock (expected 1.0, got 0) dict/name (expected 1.0, got 0) dict/prices (expected 1.0, got 0) dict/productStatus (expected 1.0, got 0) dict/supplier (expected 1.0, got 0) dict/supplierSku (expected 1.0, got 0) dict/url (expected 1.0, got 0)\n", "color": "danger", "mrkdwn_in": ["text", "pretext"] } , ] 2025-01-21 22:18:26 [wb_mason] (PID: 335) INFO: [Spidermon] CustomTemplateSendSlackMessageSpiderFinished... OK 2025-01-21 22:18:26 [wb_mason] (PID: 335) INFO: [Spidermon] ---------------------------------------------------------------------- 2025-01-21 22:18:26 [wb_mason] (PID: 335) INFO: [Spidermon] 1 action in 0.013s 2025-01-21 22:18:26 [wb_mason] (PID: 335) INFO: [Spidermon] OK 2025-01-21 22:18:26 [scrapy.extensions.feedexport] (PID: 335) INFO: No data to insert into BigQuery - closing feed storage 2025-01-21 22:18:26 [scrapy.extensions.feedexport] (PID: 335) INFO: Stored bq feed (0 items) in: bq://response-elt.dev_scrapers.catalog_item_scrape/batch:1 2025-01-21 22:18:26 [scrapy.statscollectors] (PID: 335) INFO: Dumping Scrapy stats: {'downloader/exception_count': 1, 'downloader/exception_type_count/playwright._impl._errors.TimeoutError': 1, 'downloader/request_bytes': 386, 'downloader/request_count': 1, 'downloader/request_method_count/GET': 1, 'elapsed_time_seconds': 98.057303, 'feedexport/success_count/BigQueryFeedStorage': 1, 'finish_reason': 'finished', 'finish_time': datetime.datetime(2025, 1, 21, 22, 18, 26, 142474, tzinfo=datetime.timezone.utc), 'log_count/ERROR': 5, 'log_count/INFO': 50, 'log_count/WARNING': 4, 'memusage/max': 255541248, 'memusage/startup': 127897600, 'playwright/browser_count': 1, 'playwright/context_count': 1, 'playwright/context_count/max_concurrent': 1, 'playwright/context_count/persistent/False': 1, 'playwright/context_count/remote/True': 1, 'playwright/page_count': 1, 'playwright/page_count/closed': 1, 'playwright/page_count/max_concurrent': 1, 'playwright/request_count': 59, 'playwright/request_count/aborted': 16, 'playwright/request_count/method/GET': 55, 'playwright/request_count/method/POST': 4, 'playwright/request_count/navigation': 6, 'playwright/request_count/resource_type/document': 6, 'playwright/request_count/resource_type/fetch': 1, 'playwright/request_count/resource_type/font': 2, 'playwright/request_count/resource_type/image': 4, 'playwright/request_count/resource_type/ping': 2, 'playwright/request_count/resource_type/script': 14, 'playwright/request_count/resource_type/stylesheet': 1, 'playwright/request_count/resource_type/xhr': 29, 'playwright/response_count': 18, 'playwright/response_count/method/GET': 17, 'playwright/response_count/method/POST': 1, 'playwright/response_count/resource_type/document': 3, 'playwright/response_count/resource_type/font': 1, 'playwright/response_count/resource_type/image': 1, 'playwright/response_count/resource_type/script': 8, 'playwright/response_count/resource_type/stylesheet': 1, 'playwright/response_count/resource_type/xhr': 4, 'scheduler/dequeued': 1, 'scheduler/dequeued/memory': 1, 'scheduler/enqueued': 1, 'scheduler/enqueued/memory': 1, 'spidermon/validation/validators': 1, 'spidermon/validation/validators/item/jsonschema': True, 'start_requests/product_urls': 1, 'start_time': datetime.datetime(2025, 1, 21, 22, 16, 48, 85171, tzinfo=datetime.timezone.utc)} 2025-01-21 22:18:26 [scrapy.core.engine] (PID: 335) INFO: Spider closed (finished) 2025-01-21 22:18:26 [scrapy-playwright] (PID: 335) INFO: Closing download handler 2025-01-21 22:18:26 [scrapy-playwright] (PID: 335) INFO: Closing download handler 2025-01-21 22:18:26 [scrapy-playwright] (PID: 335) INFO: Closing browser