tommyinorl

Member
Jul 13, 2006
19
0
151
file_get_contents is a simple one liner works for most sites I need to scrape (open and public data so no type of firewalls). However there are some sites I get

"Failed to open stream: Connection timed out"

So I test the same simple script on my non cPanel/WHM server with no problems. Both are running the same php versions, with same configs, I tested with several versions trying to pinpoint the problem

What would cause this?

Exact same thing happens with curl

thanks :)
 
Last edited by a moderator:

cPRex

Jurassic Moderator
Staff member
Oct 19, 2014
17,470
2,843
363
cPanel Access Level
Root Administrator
Hey there! Can you work through the things mentioned here to see if you could be experiencing a similar issue?

 

tommyinorl

Member
Jul 13, 2006
19
0
151
Yes thanks I already experimented with SSL I even downloaded the certificates and put them in directory, it may have something to do with IPv6.
 

rbairwell

Well-Known Member
May 28, 2022
129
59
28
Mansfield, Nottingham, UK
cPanel Access Level
Root Administrator
Yes of course as I said it works with most urls
Can you provide an example of a URL it works on and a URL it fails on?

Exact same thing happens with curl
I would suspect this is NOT a PHP configuration issue nor on your server - but tis most likely hat the external site you are connecting to is denying your server's IP address (or perhaps just user agent) access. This is sometimes put in place for sites which are meant to be "end customer facing only" and therefore the web-developer puts blocks in place to stop datacentre IP addresses from connecting for security reasons.

It it also possible that, as you speculated, this could be down to an IPv6 issue. I assume you can make requests such as:
curl -4 icanhazip.com
curl -6 icanhazip.com
without problems and they return your public IPv4 and IPv6 addresses?
 

tommyinorl

Member
Jul 13, 2006
19
0
151
Yes thanks for the help, this is one of hundreds that have this behavior

I use pdfparser to decode after the scrape -- pulls up instantly on my NON cPanel/WHM server but get Connection timed out on any site, any php config from what I can tell.

echo $html = file_get_contents("https://fredweb.co.frederick.va.us/sheriff/pdfs/arrests.pdf");

As I said, curl, pdfphaser, all have same responses.

thanks again for the help


HEre is a example of a working url on both my servers -- loads as expected on every site every config


I don't think they are blocking me, this happens first visit -- repeat visits yes but none of these sites allow even one and also it works on non cPanel server.
 
Last edited:

tommyinorl

Member
Jul 13, 2006
19
0
151
This is for my project police ping com if you want to check it out. The PDF scrapes are messy but if it was easy everyone would do it :)
 

rbairwell

Well-Known Member
May 28, 2022
129
59
28
Mansfield, Nottingham, UK
cPanel Access Level
Root Administrator
They are doing some sort of IP blocking.

I'm not able to access the site from my UK IP address either via Chrome, curl and wget (and that's not just the PDF file, but also the root of the domain IIS Windows Server : curl+wget fail with timeouts). https://downforeveryoneorjustme.com/fredweb.co.frederick.va.us?proto=https reports it down, so does https://isitup.org/fredweb.co.frederick.va.us but https://validator.w3.org/check?uri=...ically)&doctype=Inline&group=0&ss=1&verbose=1 and https://wave.webaim.org/report#/https://fredweb.co.frederick.va.us says it can access it. https://tools.pingdom.com can reach it from Virginia but not Frankfurt, DE - GTMetrix fails when testing from Vancouver, CA.

Only thing I can suggest to work around it is to either contact them asking for your server IP/ranges to be allowed or use a server or proxy in a different data-centre. I've done the exact same thing as you before (PDF scaping) and proxying and using VPNs was the solution I settled on. I would suggest getting a cheap Amazon EC2 instance in various regions and find out which ones can access the site (if any).
 

tommyinorl

Member
Jul 13, 2006
19
0
151
The more I investigate the more I think it has to do with IPv6 which I have no clue about -- I see some posts that say I can just disable but I don't want to do that do I?
Am I able to toggle that to test real quick?
 

tommyinorl

Member
Jul 13, 2006
19
0
151
it's not IP block because it works on Fastpanel server and I am just visiting these for the first time ever it would take several to be abuse.


You can't reason with Government I may as well make a call to this wall and talk to it.
 

rbairwell

Well-Known Member
May 28, 2022
129
59
28
Mansfield, Nottingham, UK
cPanel Access Level
Root Administrator
The more I investigate the more I think it has to do with IPv6 which I have no clue about -- I see some posts that say I can just disable but I don't want to do that do I?
Am I able to toggle that to test real quick?
If you try:
curl -4 https://fredweb.co.frederick.va.us/sheriff/pdfs/arrests.pdf
curl -6 https://fredweb.co.frederick.va.us/sheriff/pdfs/arrests.pdf
it'll force curl to fetch it via IPv4 and IPv6 appropriately: but it looks like they are are IP range blocking. The fredweb.co.frederick.va.us hostname doesn't have a AAAA entry which is needed for IPv6 so it is unlikely to be that ( DNS Lookup - Check DNS All Records and Network Tools: DNS,IP,Email )
 

rbairwell

Well-Known Member
May 28, 2022
129
59
28
Mansfield, Nottingham, UK
cPanel Access Level
Root Administrator
it's not IP block because it works on Fastpanel server and I am just visiting these for the first time ever it would take several to be abuse.
It's my first time accessing them as well and I'm blocked. Sites tend to block IP addresses they don't see the need to have access for "security reasons": even if no abuse has occurred. The fact your Fastpanel server has access just means it's IP address isn't in one of the many blocked ranges they have in place (let me hazard a guess: your cPanel and Fastpanel servers aren't in the same net-range/netblock/datacenter).
 

tommyinorl

Member
Jul 13, 2006
19
0
151
I tried proxies but didn't spend too much time on them, server has several IPs in it's pool however they are all same range. Server has been online for a while IPs have been on blacklists before.
Still these are little towns for the most part, the PDF to me shows a outdated system I don't see them doing all that plus it's illegal to block public data they either have to block everyone or nobody or they get sued, at least in the USA
 

rbairwell

Well-Known Member
May 28, 2022
129
59
28
Mansfield, Nottingham, UK
cPanel Access Level
Root Administrator
Still these are little towns for the most part, the PDF to me shows a outdated system I don't see them doing all that
It might be their upstream provider, Comcast Business, putting the IP blocks in place. The PDF generation site of organisations is usually quite separate from the web site maintenance (and especially separate from the infrastructure side of things) and so the fact the PDF looks old means absolutely nothing.

Blocked from viewing that URL in a browser?
Yep - as stated previously.

server has several IPs in it's pool however they are all same range
Are both servers in the same range though?

it's illegal to block public data they either have to block everyone or nobody or they get sued, at least in the USA
I'm in/from the UK so I can't say anything on the legality of this, but if they are providing access to US residential IP addresses (which they seem to be) they appear to be abiding by the regulations. They are unlikely to have any/very few tax-paying US citizens needing to access the data from the UK, China, Russia, France etc - so why open your site up to potential abuse from those countries? And nobody "lives" in data-centres so why allow your systems to be potentially "abused" by automated scripts, hacking attempts etc by servers if you don't have to allow them access....
 

tommyinorl

Member
Jul 13, 2006
19
0
151
That's public data and it's illegal for them to block unless for abuse. USA is different then the UK that's why we went to war and won in 1776 I think :)
I don't think you all even have a sex offender registry do you?

We are allowed to use that data, I agree scrapping is a grey area but in the end we pay for it so of course we can do what we wish with it and none of the terms say we can't -- if they did they would be sued, at least that's what the courts say.