I tried some website scraping. I successfully scraped the data from the current database table. But I want to insert "new_table" only if no record exists in "current table"
My code is (pipeline)
table = 'products' table2 = 'new_products'` def save(self, row): cursor = self.cnx.cursor() cursor.execute("SELECT DISTINCT product_id FROM products;") old_ids = [row[0] for row in cursor.fetchall()] create_query = ("INSERT INTO " + self.table + "(rowid, date, listing_id, product_id, product_name, price, url) " "VALUES (%(rowid)s, %(date)s, %(listing_id)s, %(product_id)s, %(product_name)s, %(price)s, %(url)s)") cursor.execute(create_query, row) lastRecordId = cursor.lastrowid self.cnx.commit() cursor.close() print("Item saved with ID: {}" . format(lastRecordId)) if not product_id in old_ids: create_query = ("INSERT INTO " + self.table2 + "(rowid, date, listing_id, product_id, product_name, price, url) " "VALUES (%(rowid)s, %(date)s, %(listing_id)s, %(product_id)s, %(product_name)s, %(price)s, %(url)s)")
It doesn't work properly and gives an error.
2022-05-06 12:26:57 [scrapy.core.scraper] ERROR: Error processing {'date': '2022-05-06 12:26:57.575507', 'listing_id': '0190199600119', 'price': '4199.00', 'product_id': '1209298', 'product_name': 'APPLE 11" Magic Türkçe Q Klavye Siyah', 'rowid': 456274953331128512, 'url': 'https://www.mediamarkt.com.tr/tr/product/APPLE%2011%22%20Magic%20T%C3%BCrk%C3%A7e%20Q%20Klavye%20Siyah-1209298.html'} Traceback (most recent call last): File "/usr/lib/python3/dist-packages/twisted/internet/defer.py", line 654, in _runCallbacks current.result = callback(current.result, *args, **kw) File "/usr/local/lib/python3.8/dist-packages/scrapy/utils/defer.py", line 162, in f return deferred_from_coro(coro_f(*coro_args, **coro_kwargs)) File "/root/teknosa/teknosa/pipelines.py", line 28, in process_item self.save(dict(item)) File "/root/teknosa/teknosa/pipelines.py", line 62, in save if not product_id in old_ids: NameError: name 'product_id' is not defined Saving item into db ...
I have unique product_id.
If there is no product_id in the current table, insert this product_id into "new_products"
How to make this?
Thanks.
Last edit: I'm getting this error.
2022-05-07 18:17:11 [scrapy.core.scraper] ERROR: Error processing {'date': '2022-05-07 18:17:11.902622', 'listing_id': '8713439219357', 'price': '99.00', 'product_id': '1175529', 'product_name': 'TRUST 21935 NANGA USB 3.1 Kart Okuyucu', 'rowid': -411152717288573423, 'url': 'https://www.mediamarkt.com.tr/tr/product/TRUST%2021935%20NANGA%20USB%203.1%20Kart%20Okuyucu-1175529.html'} Traceback (most recent call last): File "/usr/local/lib/python3.8/dist-packages/mysql/connector/connection_cext.py", line 523, in cmd_query self._cmysql.query(query, _mysql_connector.MySQLInterfaceError: Duplicate entry '-411152717288573423' for key 'products.rowid' During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/usr/lib/python3/dist-packages/twisted/internet/defer.py", line 654, in _runCallbacks current.result = callback(current.result, *args, **kw) File "/usr/local/lib/python3.8/dist-packages/scrapy/utils/defer.py", line 162, in f return deferred_from_coro(coro_f(*coro_args, **coro_kwargs)) File "/root/teknosa/teknosa/pipelines.py", line 28, in process_item self.save(dict(item)) File "/root/teknosa/teknosa/pipelines.py", line 69, in save cursor.execute(create_query, row) File "/usr/local/lib/python3.8/dist-packages/mysql/connector/cursor_cext.py", line 269, in execute result = self._cnx.cmd_query(stmt, raw=self._raw, File "/usr/local/lib/python3.8/dist-packages/mysql/connector/connection_cext.py", line 528, in cmd_query raise errors.get_mysql_exception(exc.errno, msg=exc.msg, mysql.connector.errors.IntegrityError: 1062 (23000): Duplicate entry '-411152717288573423' for key 'products.rowid'
P粉2783794952024-03-30 00:30:37
If you just want to insert if it doesn't exist, there is no need to do what you are doing. No need to select them all and see if the one you're looking for is there.
What you need is to create a unique index for produc_id in table 2
Then change the code to:
table = 'products' table2 = 'new_products'` def save(self, row): create_query = ("INSERT INTO " + self.table + "(rowid, date, listing_id, product_id, product_name, price, url) " "VALUES (%(rowid)s, %(date)s, %(listing_id)s, %(product_id)s, %(product_name)s, %(price)s, %(url)s)") cursor.execute(create_query, row) lastRecordId = cursor.lastrowid self.cnx.commit() print("Item saved with ID: {}" . format(lastRecordId)) create_query = ("INSERT INTO " + self.table2 + "(rowid, date, listing_id, product_id, product_name, price, url) " "VALUES (%(rowid)s, %(date)s, %(listing_id)s, %(product_id)s, %(product_name)s, %(price)s, %(url)s) ON DUPLICATE KEY UPDATE product_id=product_id") cursor.execute(create_query, row) self.cnx.commit()
If you use ON DUPLICATE KEY, when it finds a duplicate row (already existing product_id), the system will try to update the product_id to the same product_id, so it will not take effect.
These commits can be deleted if autocommit=True is set.
edit
If, as you said in your comment, you need to insert into the new table only if it does not already exist in the table, you can change your code like this:
You need to change the variable name in the row old_ids = [row[0] for row incursor.fetchall()] because you are changing the value of the row
parameter
2. Your problem lies in the if statement. The product_id variable does not exist and needs to be modified
table = 'products' table2 = 'new_products'` def save(self, row): cursor = self.cnx.cursor() cursor.execute("SELECT DISTINCT product_id FROM products;") old_ids = [element[0] for element in cursor.fetchall()] create_query = ("INSERT INTO " + self.table + "(rowid, date, listing_id, product_id, product_name, price, url) " "VALUES (%(rowid)s, %(date)s, %(listing_id)s, %(product_id)s, %(product_name)s, %(price)s, %(url)s)") cursor.execute(create_query, row) lastRecordId = cursor.lastrowid self.cnx.commit() cursor.close() print("Item saved with ID: {}" . format(lastRecordId)) if not row['product_id'] in old_ids: create_query = ("INSERT INTO " + self.table2 + "(rowid, date, listing_id, product_id, product_name, price, url) " "VALUES (%(rowid)s, %(date)s, %(listing_id)s, %(product_id)s, %(product_name)s, %(price)s, %(url)s)")