I recently switched from an HTTP site to HTTPS. To index all pages of this site using a search engine mnogosearch, I need to run a script included in mnogosearch called "indexer" that actually retrieves all pages of the website and indexes them into a MySQL table.
This "index" script must be called from the computer hosting the HTTP server, that is, from a virtual private server (VPS).
This script worked fine with the HTTP version of my site, but I have a problem with HTTPS indexing.
Indeed, in order to be able to index HTTPS pages, I use "virtual scheme as an external retrieval system"this link: [ http://www.mnogosearch.org/doc/msearch-extended-indexing.html†[1]
It allows you to use an external program to retrieve the contents of an HTTPS page.
It works by putting an external program in a script called "curl.sh":
#!/bin/sh
wget -r --no-check-certificate $1
The problem is that this command " wget -r --no-check-certificate https://example.com/" works from my local machine (it loads all the pages of my site "example.com"), but it does not work when I start it directly from my VPS, where my HTTPS server is (for example, example .com).
In the second case, it only loads index.html.
Here is what I get when I recursive wget on hosting:
$ wget -r --no-check-certificate https://example.com/
--2015-09-06 22:22:12-- https://example.com/
Résolution de example.com (example.com)...
Connexion vers example.com (example.com)...connecté.
Le propriétaire du certificat ne concorde pas avec le nom de l'hôte «example.com»
requête HTTP transmise, en attente de la réponse...200 OK
Longueur: 177 [text/html]a
Sauvegarde en : «example.com/index.html»
100%[========================================================================================================================================>] 177 --.-K/s ds 0s
2015-09-06 22:22:12 (5,08 MB/s) - «example.com/index.html» sauvegardé [177/177]
FINISHED --2015-09-06 22:22:12--
Total wall clock time: 0,5s
Downloaded: 1 files, 177 in 0s (5,08 MB/s)
and index.html is invalid, here its contents:
<html><body><h1>It works!</h1>
<p>This is the default web page for this server.</p>
<p>The web server software is running but no content has been added, yet.</p>
</body></html>
I make you notice that my HTTPS server is available on port 8443 (I made a rewrite rule that redirects an HTTPS 443 request to port 8443).
So I also tried:
wget -r --no-check-certificate https://example.com:8443/
wget , 404 eror :
$ wget -r --no-check-certificate https://example.com:8443/
--2015-09-06 22:39:03-- https://example.com:8443/
Résolution de example.com (example.com)...
Connexion vers example.com (example.com)||:8443...connecté.
requête HTTP transmise, en attente de la réponse...303 See Other
Emplacement: index.html [suivant]
--2015-09-06 22:39:04-- https://example.com:8443/index.html
Réutilisation de la connexion existante vers example.com:8443.
requête HTTP transmise, en attente de la réponse...200 OK
Longueur: 7389 (7,2K) [text/html]
Sauvegarde en : «example.com:8443/index.html»
100%[========================================================================================================================================>] 7 389 --.-K/s ds 0s
2015-09-06 22:39:04 (145 MB/s) - «example.com:8443/index.html» sauvegardé [7389/7389]
Chargement de robots.txt; svp ignorer les erreurs.
--2015-09-06 22:39:04-- https://example.com:8443/robots.txt
Réutilisation de la connexion existante vers example.com:8443.
requête HTTP transmise, en attente de la réponse...200 OK
Longueur: 138 [text/plain]
Sauvegarde en : «example.com:8443/robots.txt»
100%[========================================================================================================================================>] 138 --.-K/s
. , Apache- Apache, Twisted- 8443, 443 8443