crawler DOMDocument werkt niet
Een tijdje geleden heb ik een crawler geschreven die links (en links van plaatjes) kopieert en terug stuurt in een array. Ik krijg alleen maar 2 links eruit terug, dit is de link die ik instuur, een keer met een / op het einde en een keer zonder (de eerste is met en de tweede zonder). Ziet iemand wat er mis is?
Code (php)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
function get_urls($website) {
$checked = array();
$noncheck = array('http://'.$website.'/');
$max_urls_to_check = 50;
$filetypes = array('pdf','jpg','gif','png','doc','docx','xls','xlsx','ppt','pptx','xml','js','css'); // Any filetype which can be found which is not made up out of html
$urls_checked = 0;
while ((isset($noncheck[0])) && ($urls_checked <= $max_urls_to_check)) {
$doc = new DOMDocument();
$doc->loadHTMLFile($noncheck[0]);
foreach ($doc->getElementsByTagName('a') as $link) {
$href = $link->getAttribute('href');
$href = preg_replace('/#*./','',$href);
if (stripos($href, 'http://') !== false || stripos($href, 'https://') !== false) { //absolute link
if (stripos($href, $website) === false || preg_match('/\.'.$filetypes.'$/',$href)) { // If the file is not a html file)
$checked[] = $href; // If the link links to another website (domain)
} else {
if ((!in_array($href,$checked)) && (!in_array($href,$noncheck))) { // Case Insensitive problems!
$noncheck[] = $href; // If the link links to the right domain
}
}
} else { // If the link is a relative link
$href = str_replace('./','',$href);
if (!(strpos($href, '/') == 0)) {
$href = '/'.$href;
}
$href = 'http://'.$website.$href;
if ((!in_array($href,$checked)) && (!in_array($href,$noncheck))) { // Case Insensitive problems!
if (preg_match('/'.$filetypes.'$/',$href)) { // If the file is not an html file
$checked[] = $href;
} else {
$noncheck[] = $href;
}
}
}
} // end foreach a links
foreach ($doc->getElementsByTagName('img') as $img) {
$src = $link->getAttribute('src');
if (stripos($src, 'http://') !== false || stripos($src, 'https://') !== false) {
if (!in_array($src,$checked)) {
$checked[] = $src;
}
} else { // If the link is a relative link
$src = str_replace('./','',$src);
if (!(strpos($src, '/') == 0)) {
$src = '/'.$src;
}
$src = 'http://'.$website.$src;
if (!in_array($src,$checked)) {
$checked[] = $src;
}
}
} // end foreach images
$checked[] = $noncheck[0];
array_shift($noncheck);
$urls_checked += 1;
} // end checking links
return $checked;
}
$hoi = get_urls('google.com');
print_r($hoi);
$checked = array();
$noncheck = array('http://'.$website.'/');
$max_urls_to_check = 50;
$filetypes = array('pdf','jpg','gif','png','doc','docx','xls','xlsx','ppt','pptx','xml','js','css'); // Any filetype which can be found which is not made up out of html
$urls_checked = 0;
while ((isset($noncheck[0])) && ($urls_checked <= $max_urls_to_check)) {
$doc = new DOMDocument();
$doc->loadHTMLFile($noncheck[0]);
foreach ($doc->getElementsByTagName('a') as $link) {
$href = $link->getAttribute('href');
$href = preg_replace('/#*./','',$href);
if (stripos($href, 'http://') !== false || stripos($href, 'https://') !== false) { //absolute link
if (stripos($href, $website) === false || preg_match('/\.'.$filetypes.'$/',$href)) { // If the file is not a html file)
$checked[] = $href; // If the link links to another website (domain)
} else {
if ((!in_array($href,$checked)) && (!in_array($href,$noncheck))) { // Case Insensitive problems!
$noncheck[] = $href; // If the link links to the right domain
}
}
} else { // If the link is a relative link
$href = str_replace('./','',$href);
if (!(strpos($href, '/') == 0)) {
$href = '/'.$href;
}
$href = 'http://'.$website.$href;
if ((!in_array($href,$checked)) && (!in_array($href,$noncheck))) { // Case Insensitive problems!
if (preg_match('/'.$filetypes.'$/',$href)) { // If the file is not an html file
$checked[] = $href;
} else {
$noncheck[] = $href;
}
}
}
} // end foreach a links
foreach ($doc->getElementsByTagName('img') as $img) {
$src = $link->getAttribute('src');
if (stripos($src, 'http://') !== false || stripos($src, 'https://') !== false) {
if (!in_array($src,$checked)) {
$checked[] = $src;
}
} else { // If the link is a relative link
$src = str_replace('./','',$src);
if (!(strpos($src, '/') == 0)) {
$src = '/'.$src;
}
$src = 'http://'.$website.$src;
if (!in_array($src,$checked)) {
$checked[] = $src;
}
}
} // end foreach images
$checked[] = $noncheck[0];
array_shift($noncheck);
$urls_checked += 1;
} // end checking links
return $checked;
}
$hoi = get_urls('google.com');
print_r($hoi);
Vraagjes over de werking van de code zijn welkom (; en dit is de eerste keer dat ik met DOMDocuments werk dus er is best een kans dat daar de fout zit maar ik snap het niet meer...
Bedankt voor het kijken alvast!
Gewijzigd op 06/02/2013 22:18:22 door Jyy An
Niemand die hier het antwoord op weet? :(
Code (php)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
PHP Warning: DOMDocument::loadHTMLFile(): htmlParseEntityRef: expecting ';' in http://google.com/, line: 43 in /home/ibreeden/tmp/kanweg.php on line 10
PHP Warning: DOMDocument::loadHTMLFile(): htmlParseEntityRef: expecting ';' in http://google.com/, line: 43 in /home/ibreeden/tmp/kanweg.php on line 10
PHP Warning: DOMDocument::loadHTMLFile(): htmlParseEntityRef: expecting ';' in http://google.com/, line: 43 in /home/ibreeden/tmp/kanweg.php on line 10
PHP Warning: DOMDocument::loadHTMLFile(): htmlParseEntityRef: expecting ';' in http://google.com/, line: 43 in /home/ibreeden/tmp/kanweg.php on line 10
PHP Warning: DOMDocument::loadHTMLFile(): htmlParseEntityRef: expecting ';' in http://google.com/, line: 43 in /home/ibreeden/tmp/kanweg.php on line 10
PHP Warning: DOMDocument::loadHTMLFile(): htmlParseEntityRef: expecting ';' in http://google.com/, line: 43 in /home/ibreeden/tmp/kanweg.php on line 10
PHP Warning: DOMDocument::loadHTMLFile(): htmlParseEntityRef: expecting ';' in http://google.com/, line: 43 in /home/ibreeden/tmp/kanweg.php on line 10
PHP Warning: DOMDocument::loadHTMLFile(): htmlParseEntityRef: expecting ';' in http://google.com/, line: 43 in /home/ibreeden/tmp/kanweg.php on line 10
PHP Warning: DOMDocument::loadHTMLFile(): htmlParseEntityRef: expecting ';' in http://google.com/, line: 43 in /home/ibreeden/tmp/kanweg.php on line 10
PHP Warning: DOMDocument::loadHTMLFile(): htmlParseEntityRef: expecting ';' in http://google.com/, line: 43 in /home/ibreeden/tmp/kanweg.php on line 10
PHP Warning: DOMDocument::loadHTMLFile(): htmlParseEntityRef: expecting ';' in http://google.com/, line: 43 in /home/ibreeden/tmp/kanweg.php on line 10
PHP Warning: DOMDocument::loadHTMLFile(): htmlParseEntityRef: expecting ';' in http://google.com/, line: 43 in /home/ibreeden/tmp/kanweg.php on line 10
PHP Warning: DOMDocument::loadHTMLFile(): htmlParseEntityRef: expecting ';' in http://google.com/, line: 43 in /home/ibreeden/tmp/kanweg.php on line 10
PHP Warning: DOMDocument::loadHTMLFile(): htmlParseEntityRef: no name in http://google.com/, line: 43 in /home/ibreeden/tmp/kanweg.php on line 10
PHP Warning: DOMDocument::loadHTMLFile(): htmlParseEntityRef: expecting ';' in http://google.com/, line: 43 in /home/ibreeden/tmp/kanweg.php on line 10
PHP Warning: DOMDocument::loadHTMLFile(): htmlParseEntityRef: expecting ';' in http://google.com, line: 43 in /home/ibreeden/tmp/kanweg.php on line 10
PHP Warning: DOMDocument::loadHTMLFile(): htmlParseEntityRef: expecting ';' in http://google.com, line: 43 in /home/ibreeden/tmp/kanweg.php on line 10
PHP Warning: DOMDocument::loadHTMLFile(): htmlParseEntityRef: expecting ';' in http://google.com, line: 43 in /home/ibreeden/tmp/kanweg.php on line 10
PHP Warning: DOMDocument::loadHTMLFile(): htmlParseEntityRef: expecting ';' in http://google.com, line: 43 in /home/ibreeden/tmp/kanweg.php on line 10
PHP Warning: DOMDocument::loadHTMLFile(): htmlParseEntityRef: expecting ';' in http://google.com, line: 43 in /home/ibreeden/tmp/kanweg.php on line 10
PHP Warning: DOMDocument::loadHTMLFile(): htmlParseEntityRef: expecting ';' in http://google.com, line: 43 in /home/ibreeden/tmp/kanweg.php on line 10
PHP Warning: DOMDocument::loadHTMLFile(): htmlParseEntityRef: expecting ';' in http://google.com, line: 43 in /home/ibreeden/tmp/kanweg.php on line 10
PHP Warning: DOMDocument::loadHTMLFile(): htmlParseEntityRef: expecting ';' in http://google.com, line: 43 in /home/ibreeden/tmp/kanweg.php on line 10
PHP Warning: DOMDocument::loadHTMLFile(): htmlParseEntityRef: expecting ';' in http://google.com, line: 43 in /home/ibreeden/tmp/kanweg.php on line 10
PHP Warning: DOMDocument::loadHTMLFile(): htmlParseEntityRef: expecting ';' in http://google.com, line: 43 in /home/ibreeden/tmp/kanweg.php on line 10
PHP Warning: DOMDocument::loadHTMLFile(): htmlParseEntityRef: expecting ';' in http://google.com, line: 43 in /home/ibreeden/tmp/kanweg.php on line 10
PHP Warning: DOMDocument::loadHTMLFile(): htmlParseEntityRef: expecting ';' in http://google.com, line: 43 in /home/ibreeden/tmp/kanweg.php on line 10
PHP Warning: DOMDocument::loadHTMLFile(): htmlParseEntityRef: expecting ';' in http://google.com, line: 43 in /home/ibreeden/tmp/kanweg.php on line 10
PHP Warning: DOMDocument::loadHTMLFile(): htmlParseEntityRef: no name in http://google.com, line: 43 in /home/ibreeden/tmp/kanweg.php on line 10
PHP Warning: DOMDocument::loadHTMLFile(): htmlParseEntityRef: expecting ';' in http://google.com, line: 43 in /home/ibreeden/tmp/kanweg.php on line 10
Array
(
[0] => http://google.com/
[1] => http://google.com
)
PHP Warning: DOMDocument::loadHTMLFile(): htmlParseEntityRef: expecting ';' in http://google.com/, line: 43 in /home/ibreeden/tmp/kanweg.php on line 10
PHP Warning: DOMDocument::loadHTMLFile(): htmlParseEntityRef: expecting ';' in http://google.com/, line: 43 in /home/ibreeden/tmp/kanweg.php on line 10
PHP Warning: DOMDocument::loadHTMLFile(): htmlParseEntityRef: expecting ';' in http://google.com/, line: 43 in /home/ibreeden/tmp/kanweg.php on line 10
PHP Warning: DOMDocument::loadHTMLFile(): htmlParseEntityRef: expecting ';' in http://google.com/, line: 43 in /home/ibreeden/tmp/kanweg.php on line 10
PHP Warning: DOMDocument::loadHTMLFile(): htmlParseEntityRef: expecting ';' in http://google.com/, line: 43 in /home/ibreeden/tmp/kanweg.php on line 10
PHP Warning: DOMDocument::loadHTMLFile(): htmlParseEntityRef: expecting ';' in http://google.com/, line: 43 in /home/ibreeden/tmp/kanweg.php on line 10
PHP Warning: DOMDocument::loadHTMLFile(): htmlParseEntityRef: expecting ';' in http://google.com/, line: 43 in /home/ibreeden/tmp/kanweg.php on line 10
PHP Warning: DOMDocument::loadHTMLFile(): htmlParseEntityRef: expecting ';' in http://google.com/, line: 43 in /home/ibreeden/tmp/kanweg.php on line 10
PHP Warning: DOMDocument::loadHTMLFile(): htmlParseEntityRef: expecting ';' in http://google.com/, line: 43 in /home/ibreeden/tmp/kanweg.php on line 10
PHP Warning: DOMDocument::loadHTMLFile(): htmlParseEntityRef: expecting ';' in http://google.com/, line: 43 in /home/ibreeden/tmp/kanweg.php on line 10
PHP Warning: DOMDocument::loadHTMLFile(): htmlParseEntityRef: expecting ';' in http://google.com/, line: 43 in /home/ibreeden/tmp/kanweg.php on line 10
PHP Warning: DOMDocument::loadHTMLFile(): htmlParseEntityRef: expecting ';' in http://google.com/, line: 43 in /home/ibreeden/tmp/kanweg.php on line 10
PHP Warning: DOMDocument::loadHTMLFile(): htmlParseEntityRef: no name in http://google.com/, line: 43 in /home/ibreeden/tmp/kanweg.php on line 10
PHP Warning: DOMDocument::loadHTMLFile(): htmlParseEntityRef: expecting ';' in http://google.com/, line: 43 in /home/ibreeden/tmp/kanweg.php on line 10
PHP Warning: DOMDocument::loadHTMLFile(): htmlParseEntityRef: expecting ';' in http://google.com, line: 43 in /home/ibreeden/tmp/kanweg.php on line 10
PHP Warning: DOMDocument::loadHTMLFile(): htmlParseEntityRef: expecting ';' in http://google.com, line: 43 in /home/ibreeden/tmp/kanweg.php on line 10
PHP Warning: DOMDocument::loadHTMLFile(): htmlParseEntityRef: expecting ';' in http://google.com, line: 43 in /home/ibreeden/tmp/kanweg.php on line 10
PHP Warning: DOMDocument::loadHTMLFile(): htmlParseEntityRef: expecting ';' in http://google.com, line: 43 in /home/ibreeden/tmp/kanweg.php on line 10
PHP Warning: DOMDocument::loadHTMLFile(): htmlParseEntityRef: expecting ';' in http://google.com, line: 43 in /home/ibreeden/tmp/kanweg.php on line 10
PHP Warning: DOMDocument::loadHTMLFile(): htmlParseEntityRef: expecting ';' in http://google.com, line: 43 in /home/ibreeden/tmp/kanweg.php on line 10
PHP Warning: DOMDocument::loadHTMLFile(): htmlParseEntityRef: expecting ';' in http://google.com, line: 43 in /home/ibreeden/tmp/kanweg.php on line 10
PHP Warning: DOMDocument::loadHTMLFile(): htmlParseEntityRef: expecting ';' in http://google.com, line: 43 in /home/ibreeden/tmp/kanweg.php on line 10
PHP Warning: DOMDocument::loadHTMLFile(): htmlParseEntityRef: expecting ';' in http://google.com, line: 43 in /home/ibreeden/tmp/kanweg.php on line 10
PHP Warning: DOMDocument::loadHTMLFile(): htmlParseEntityRef: expecting ';' in http://google.com, line: 43 in /home/ibreeden/tmp/kanweg.php on line 10
PHP Warning: DOMDocument::loadHTMLFile(): htmlParseEntityRef: expecting ';' in http://google.com, line: 43 in /home/ibreeden/tmp/kanweg.php on line 10
PHP Warning: DOMDocument::loadHTMLFile(): htmlParseEntityRef: expecting ';' in http://google.com, line: 43 in /home/ibreeden/tmp/kanweg.php on line 10
PHP Warning: DOMDocument::loadHTMLFile(): htmlParseEntityRef: expecting ';' in http://google.com, line: 43 in /home/ibreeden/tmp/kanweg.php on line 10
PHP Warning: DOMDocument::loadHTMLFile(): htmlParseEntityRef: no name in http://google.com, line: 43 in /home/ibreeden/tmp/kanweg.php on line 10
PHP Warning: DOMDocument::loadHTMLFile(): htmlParseEntityRef: expecting ';' in http://google.com, line: 43 in /home/ibreeden/tmp/kanweg.php on line 10
Array
(
[0] => http://google.com/
[1] => http://google.com
)
Nu alleen nog een probleem; als het document leeg is, wil ik dit kunnen zien. Dit werkt echter niet ;s
ik had hiervoor deze code:
Code (php)
1
2
3
4
5
6
7
8
9
10
11
2
3
4
5
6
7
8
9
10
11
$urls = Array('http://google.com','http://news.google.com/nwshp?hl=en&tab=ln', 'https://mail.google.com/mail/?tab=lm', 'https://drive.google.com/?tab=lo','http://dezewebsitebestaaaatnieet.net');
for ($i = 0; $i < count($urls);$i++) {
$doccc = new DOMDocument();
if ((!(@$doccc->loadHTMLFile($urls[$i]))) || ($doccc->saveHTML() == '')) {
echo 'pagina (of hele website) is dood<br />';
} else {
echo 'pagina is levend of een aangepaste 404-pagina<br />';
}
}
Hij zegt alleen dat dus de niet bestaande website wel levend is (terwijl http://dezewebsitebestaaaatnieet.net echt niet bestaat hoor ;o)...
even voor alle duidelijkheid, ik krijg nu dit:
Quote:
pagina is levend of een aangepaste 404-pagina
pagina is levend of een aangepaste 404-pagina
pagina is levend of een aangepaste 404-pagina
pagina is levend of een aangepaste 404-pagina
pagina is levend of een aangepaste 404-pagina
pagina is levend of een aangepaste 404-pagina
pagina is levend of een aangepaste 404-pagina
pagina is levend of een aangepaste 404-pagina
pagina is levend of een aangepaste 404-pagina
en ik wil dit:
Quote:
pagina is levend of een aangepaste 404-pagina
pagina is levend of een aangepaste 404-pagina
pagina is levend of een aangepaste 404-pagina
pagina is levend of een aangepaste 404-pagina
pagina (of hele website) is dood
pagina is levend of een aangepaste 404-pagina
pagina is levend of een aangepaste 404-pagina
pagina is levend of een aangepaste 404-pagina
pagina (of hele website) is dood