prevent sponsor with different urls being added as dupes

rax
Posts: 172
Joined: Sat Oct 05, 2013 9:28 pm

prevent sponsor with different urls being added as dupes

Post by rax »

Hello,

I've noticed that there are many duplicate galleries for a specific sponsor I'm importing sets from. In the 1,000's. Setting for Check DB for Dupes is set to yes.

I believe the problem is that this specific tube sponsor is using different URL's, both "www" and without, and that's why dupe content is being added added.

1) Can you think of a solution to prevent this?

2) As there are many galleries with dupes, in the 1,000's, that have already gone through rotation, is there anyway to delete the current duplicates from my database?
admin
Site Admin
Posts: 37233
Joined: Wed Sep 10, 2008 11:43 am

Re: prevent sponsor with different urls being added as dupes

Post by admin »

Do you have this option on ?

Check for dupe thumbs
script calculates MD5 check summ for each created thumb
if we already have the same MD5 - this usually means that thumb was created
from the same content
for example, FHG with the same content but diff. design/url
Don't forget to run script update
rax
Posts: 172
Joined: Sat Oct 05, 2013 9:28 pm

Re: prevent sponsor with different urls being added as dupes

Post by rax »

I had found that setting yesterday before I made this post. It was set to NO and I changed it to YES. I'm not sure as of yet if it will prevent the issue. There is also another setting, Check for dupe description, and that was set to YES already. I don't use description field for my importing as I only use title. Maybe it would be helpful to have a setting for checking dupes for title as well?

What about my 2nd question, no way to find dupes with the same url and delete them from the database?
rax
Posts: 172
Joined: Sat Oct 05, 2013 9:28 pm

Re: prevent sponsor with different urls being added as dupes

Post by rax »

Looks like the thumb MD5 dupe feature is working.

I checked the gallery_grabber.log and noticed a few issues.

There are over 5,000 counts of the phrase "Looks like we already have this thumb". This is a lot. So the original URL dupe check isn't working for this sponsor. The tube is txxx.

Looking closer at the source URL's for already created gallery's, they aren't using the original source url for the tube gallery. A non domain url is being created somehow for the source url.
They look like this, "http://1c0ee2da9e6cb4e3743c598fac5ecd9c/"and different one for each gallery. I checked other tube sponsors and this is the only one I'm having this problem with.

Here is an excerpt from the error log:

Code: Select all

2017-06-21 01:16:15: Processing http://431cb1abe65a51fe731d331239ce0078/ (425632) (0.54480195045471, 0.0023047924041748)
2017-06-21 01:16:15: Gallery description is empty: Update  with 'admin added'  (0.54583501815796, 0.0010290145874023)
2017-06-21 01:16:15: Content type: 1 (0.54843211174011, 0.0025930404663086)
2017-06-21 01:16:15: Creating thumb  (320x180) (Crop profile: 1)  (0.54894304275513, 0.00050592422485352)
2017-06-21 01:16:15: Downloading img http://txxx.com/get_file/0/1bbedaa6738e6776e285bff5347c27ed/4140000/4140031/screenshots/5.jpg (../tmp/425632/tmp//751924.jpg) (0.54908108711243, 0.00013399124145508)
2017-06-21 01:16:15: Dupe check 70afad2e70b1f92ab66005714eff3e61  (0.86377501487732, 0.31469297409058)
2017-06-21 01:16:15: Grab: http://431cb1abe65a51fe731d331239ce0078/ :  Looks like we already have this thumb  (id: 733099 md5: 70afad2e70b1f92ab66005714eff3e61), skip...  (0.86423993110657, 0.0004580020904541)
2017-06-21 01:16:15: Can not create thumb from http://txxx.com/get_file/0/1bbedaa6738e6776e285bff5347c27ed/4140000/4140031/screenshots/5.jpg () (0.86920309066772, 0.0049607753753662)
2017-06-21 01:16:15: No thumbs were created (0.86929106712341, 8.6069107055664E-5)
2017-06-21 01:16:15: Cleanup tmp folder ../tmp/425632 (0.86935591697693, 6.1988830566406E-5)
2017-06-21 01:16:15: Deleting gallery (0.86954498291016, 0.00018811225891113)
2017-06-21 01:16:15: Delete gallery 425632   from /domain.com/tcms/bin/gallery_grabber.php (0.86957693099976, 2.9087066650391E-5)
2017-06-21 01:16:15: Delete gallery content 425/632 (0.9338071346283, 0.064228057861328)
admin
Site Admin
Posts: 37233
Joined: Wed Sep 10, 2008 11:43 am

Re: prevent sponsor with different urls being added as dupes

Post by admin »

How do you import galleries ? pattern
Don't forget to run script update
rax
Posts: 172
Joined: Sat Oct 05, 2013 9:28 pm

Re: prevent sponsor with different urls being added as dupes

Post by rax »

Sorry, I was going to post a pattern, wasn't sure if you needed it.

example:

Code: Select all

url|title|duration|date|thumb|embed|group|tags

http://txxx.com/videos/4257815/chloe-carter-in-solo-movie-amkingdom/|Chloe Carter in Solo Movie - AmKingdom|479|2017-06-22 10:58:00|http://txxx.com/get_file/0/f3c3c057fbb1c7d3baa376fad4358789/4257000/4257815/screenshots/9.jpg|<iframe width="1280" height="745" src="http://txxx.com/embed/4257815" frameborder="0" allowfullscreen webkitallowfullscreen mozallowfullscreen oallowfullscreen msallowfullscreen></iframe>|Solo Girl,Masturbation,Teens,Small Tits,Tattoos|AmKingdom
admin
Site Admin
Posts: 37233
Joined: Wed Sep 10, 2008 11:43 am

Re: prevent sponsor with different urls being added as dupes

Post by admin »

Ok, so you are sure there's no such thumb in DB and it says this thumb already exists ?
Don't forget to run script update
rax
Posts: 172
Joined: Sat Oct 05, 2013 9:28 pm

Re: prevent sponsor with different urls being added as dupes

Post by rax »

The thumb does exist. And dupe gallery's are getting added.

There is another problem I mentioned above regarding source url's not containing the correct source url.

Example:

Code: Select all

Source url:

http://txxx.com/videos/4257815/chloe-carter-in-solo-movie-amkingdom/

Source url being added to gallery's after they are imported:

http://1c0ee2da9e6cb4e3743c598fac5ecd9c/

So this is why there are so many dupe gallery's being added. Script checks source url before import and it doesn't find the url, because its been changed, and then adds a duplicate gallery.

Let me know if you don't understand I'll try and explain more.
admin
Site Admin
Posts: 37233
Joined: Wed Sep 10, 2008 11:43 am

Re: prevent sponsor with different urls being added as dupes

Post by admin »

I've tried to add this line 2 times - it says "duplicate"
how do you test it ?
Don't forget to run script update
rax
Posts: 172
Joined: Sat Oct 05, 2013 9:28 pm

Re: prevent sponsor with different urls being added as dupes

Post by rax »

How can I delete duplicate galleries that have same MD5 thumb?
Post Reply