Documentation
HTML PARSER
System Requirements
Preliminaries
- Determine the path to PERL 5 on your web
server host. Note that some web hosting companies run both PERL 4 and PERL 5.
Make ABSOLUTELY sure you are not setting this up under PERL 4. Ask your
administrator if you are not sure.
- Unpack the tar archive on your desktop using a
program that unpacks UNIX TAR ARCHIVES. If you don't have such a program then download
WINZIP FREE from SHAREWARE.COM.
- After you have unpacked the TAR archive you
will have a collection of folders and files on your desktop. Now you have to do some
basic editing of each of these files (or at least some of them). Use a text editor
such as wordpad, notepad, BBEdit, simpletext, or teachtext to edit the files. These
are NOT WORD PROCESSOR DOCUMENTS they are just simple TEXT files so don't save them as
word processor documents or save them with extentions such as .txt or they will NOT WORK.
Note that there may be a some files inside of folders which are "blank".
This is normal.
Preparing the CGI scripts
Define Path To PERL 5
The first step is to open up each and every
file that has a .cgi extention and edit line number one of each script. Each of the
cgi scripts is written in perl 5. For your scripts to run they must know where perl 5 is
installed on your web server. The path to perl 5 is defined to a cgi script in the first
line of the file. In each of the cgi scripts the first line of code looks something like
this:
#!/usr/bin/perl
If the path to perl 5 on your web server is
different from /usr/bin/perl you must edit the first line of each cgi script to reflect
the correct path. If the path to perl 5 is the same no changes are necessary. If you do
not know the path to perl 5 ask the webmaster or system administrator at your server site.
Configure the .cgi files
configure.cgi
Set variables inside of configure.cgi like so
$subdir =
"/full/path/to/yourdomain.com/htdocs";
$parsedb="parsed.txt";
$webpagelist="sites.txt";
$outputdescriptionlength = 350;
$pointsfortitlematch = 10;
$pointsformetadescriptionmatch = 10;
$pointsformetakeywordsmatch = 10;
$subdir is the same as $rooturl
without the ending backslash
$parsedb is the output file
containing the parsed html now turne into text
$webpagelist is full path to
sites.txt
$outputdescriptionlength is the
length of each search result 350 characters is pretty good
$pointsfortitlematch effects search
engine ranking - more points if keyword is in title the higher it ranks
$pointsformetadescriptionmatch
effects search engine ranking - more points if keyword is in metadescription the higher it
ranks
$pointsformetakeywordsmatch effects
search engine ranking - more points if keyword is in metakeywords the higher it ranks
Upload Your Edited CGI and Database Files
- Create directory inside cgi-bin called
htmlparser and upload all files, chmod everything to 755 that ends in .cgi, and everything
else to 666 or 777.
- Use sites.txt and spider.cgi to spider
directories you want to parse
- Also you can just upload html files into the
cgi-bin/htmlparser directory and manually add the filenames (eg a.html, xyz.html,
index.html) to sites.txt - sites.txt will in the end contain full paths to every html file
you are planning to convert into text
- Run htmlparse.cgi by telnet to create the TEXT
output inside of parsedb.txt
Altering the output of the parser
- Find this line inside of htmlparse.cgi
print PARSEDHTML
"$title|$metadesc|$metakeywords|$newtext|$uniquewords|$webpages|$maxfrequency|$countwords|$countuniquewords|$lastmodified|$searchavailability\n";
- You can obviously rearrange, edit or remove
any fields you want. If you just want unique keywords or just the raw text remove
whatever fields you find useless.
|