概述
c语言使用ekhtml库解析html
El-Kabong 是一个快速的,不带验证的,SAX接口的HTML解析器。 它的目标是提供一个快速,轻量级的解析HTML的库,对于语法不正确的标签也可以识别。
可从sourceforge下载源码:http://sourceforge.net/projects/ekhtml/
* 最新版本ekhtml-0.3.2.
可从sourceforge下载源码:http://sourceforge.net/projects/ekhtml/
* 最新版本ekhtml-0.3.2.
下载完毕后,解压开,进入到解压开的目录下,输入./configure && make编译文档。这些都在INSTALL文件里,可以参考一下。
然后,在src目录下的.lib目录下会有一个libekhtml.a的文件,这个文件就是ekhtml的静态链接库。静态链接库可以看成是.o文件的集合。
最后输入sudo make install就可以将ekhtml库安装到计算机上,ekhtml.h头文件在/usr/local/include目录下,静态链接库和动态链接库在/usr/local/lib目录下。
使用时,包含头文件ekhtml.h就可以了,在链接时要记得加上,/usr/local/lib/libekhtml.a文件。
下面是链接库的官方文档:
Libraries have been installed in:
/usr/local/lib
If you ever happen to want to link against installed libraries
in a given directory, LIBDIR, you must either use libtool, and
specify the full pathname of the library, or use the `-LLIBDIR'
flag during linking and do at least one of the following:
- add LIBDIR to the `LD_LIBRARY_PATH' environment variable
during execution
- add LIBDIR to the `LD_RUN_PATH' environment variable
during linking
- use the `-Wl,--rpath -Wl,LIBDIR' linker flag
- have your system administrator add LIBDIR to `/etc/ld.so.conf'
See any operating system documentation about shared libraries for
more information, such as the ld(1) and ld.so(8) manual pages.
/usr/local/lib
If you ever happen to want to link against installed libraries
in a given directory, LIBDIR, you must either use libtool, and
specify the full pathname of the library, or use the `-LLIBDIR'
flag during linking and do at least one of the following:
- add LIBDIR to the `LD_LIBRARY_PATH' environment variable
during execution
- add LIBDIR to the `LD_RUN_PATH' environment variable
during linking
- use the `-Wl,--rpath -Wl,LIBDIR' linker flag
- have your system administrator add LIBDIR to `/etc/ld.so.conf'
See any operating system documentation about shared libraries for
more information, such as the ld(1) and ld.so(8) manual pages.
这个是官方给的测试程序:
编译命令为:gcc -o tester tester.c /usr/local/lib/libekhtml.a
/*
* Copyright (c) 2002, Jon Travis
* All rights reserved.
*
* Redistribution and use in source and binary forms, with or without
* modification, are permitted provided that the following conditions
* are met:
*
* 1. Redistributions of source code must retain the above copyright
* notice, this list of conditions and the following disclaimer.
* 2. Redistributions in binary form must reproduce the above copyright
* notice, this list of conditions and the following disclaimer in the
* documentation and/or other materials provided with the distribution.
*
* THIS SOFTWARE IS PROVIDED BY THE AUTHOR ``AS IS'' AND ANY EXPRESS OR
* IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES
* OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED.
* IN NO EVENT SHALL THE AUTHOR BE LIABLE FOR ANY DIRECT, INDIRECT,
* INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT
* NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
* DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
* THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
* (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF
* THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
*/
#include <stdio.h>
#include <stdlib.h>
#include <assert.h>
#include "ekhtml.h"
#define MAGIC_DOODIE 0xf9d33bc1
typedef struct {
unsigned int n_starttags;
unsigned int n_endtags;
unsigned int n_comments;
unsigned int n_data;
unsigned int magic_doodie;
unsigned int only_parse;
} tester_cbdata;
static void handle_starttag_way(void *cbdata, ekhtml_string_t *tag,
ekhtml_attr_t *attrs)
{
printf("GOT WAY START!n");
}
static void handle_starttag(void *cbdata, ekhtml_string_t *tag,
ekhtml_attr_t *attrs)
{
ekhtml_attr_t *attr;
tester_cbdata *tdata = cbdata;
assert(tdata->magic_doodie == MAGIC_DOODIE);
tdata->n_starttags++;
if(tdata->only_parse)
return;
printf("START: "%.*s"n", (int)tag->len, tag->str);
for(attr=attrs; attr; attr=attr->next) {
printf("ATTRIBUTE: "%.*s" = ", (int)attr->name.len, attr->name.str);
if(!attr->isBoolean)
printf(""%.*s"n", (int)attr->val.len, attr->val.str);
else
printf(""%.*s"n", (int)attr->name.len, attr->name.str);
}
}
static void handle_endtag(void *cbdata, ekhtml_string_t *str){
tester_cbdata *tdata = cbdata;
assert(tdata->magic_doodie == MAGIC_DOODIE);
tdata->n_endtags++;
if(tdata->only_parse)
return;
printf("END: "%.*s"n", (int)str->len, str->str);
}
static void handle_comment(void *cbdata, ekhtml_string_t *str){
tester_cbdata *tdata = cbdata;
assert(tdata->magic_doodie == MAGIC_DOODIE);
tdata->n_comments++;
if(tdata->only_parse)
return;
printf("COMMENT: "%.*s"n", (int)str->len, str->str);
}
static void handle_data(void *cbdata, ekhtml_string_t *str){
tester_cbdata *tdata = cbdata;
assert(tdata->magic_doodie == MAGIC_DOODIE);
tdata->n_data++;
if(tdata->only_parse)
return;
fwrite(str->str, str->len, 1, stdout);
}
int main(int argc, char *argv[]){
tester_cbdata cbdata;
ekhtml_parser_t *ekparser;
char *buf;
size_t nbuf;
int feedsize;
if(argc < 2){
fprintf(stderr, "Syntax: %s <feedsize> [1|0 (to print debug)]n",
argv[0]);
return -1;
}
feedsize = atoi(argv[1]);
ekparser = ekhtml_parser_new(NULL);
cbdata.n_starttags = 0;
cbdata.n_endtags = 0;
cbdata.n_comments = 0;
cbdata.n_data = 0;
cbdata.magic_doodie = MAGIC_DOODIE;
cbdata.only_parse = argc == 3;
ekhtml_parser_datacb_set(ekparser, handle_data);
ekhtml_parser_commentcb_set(ekparser, handle_comment);
ekhtml_parser_startcb_add(ekparser, "WAY", handle_starttag_way);
ekhtml_parser_startcb_add(ekparser, NULL, handle_starttag);
ekhtml_parser_endcb_add(ekparser, NULL, handle_endtag);
ekhtml_parser_cbdata_set(ekparser, &cbdata);
buf = malloc(feedsize);
while((nbuf = fread(buf, 1, feedsize, stdin))){
ekhtml_string_t str;
str.str = buf;
str.len = nbuf;
ekhtml_parser_feed(ekparser, &str);
ekhtml_parser_flush(ekparser, 0);
}
ekhtml_parser_flush(ekparser, 1);
ekhtml_parser_destroy(ekparser);
free(buf);
if(argc == 3){
fprintf(stderr,
"# starttags: %un"
"# endtags: %un"
"# comments: %un"
"# data: %un", cbdata.n_starttags,
cbdata.n_endtags, cbdata.n_comments, cbdata.n_data);
}
return 0;
}
可见,使用ekhtml库的时候,是很方便的,只需要包含进来ekhtml.h头文件就可以使用它的函数了。
下面来介绍一下ekhtml函数的使用(官方的文档在docs文件夹下):
ekhtml中有两个结构体,一个是ekhtml_attr_t,用来表示标签的属性,定义如下:
typedef struct ekhtml_attr_t {
ekhtml_string_t name; /**< Name of the attribute */
ekhtml_string_t val; /**< Value of the attribute */
unsigned int isBoolean; /**< True of the attribute is boolean */
struct ekhtml_attr_t *next; /**< Pointer to next attribute in the list */
} ekhtml_attr_t;
这个结构体将会传递给回调函数,属性的名字和属性的值都是ekhtml_string_t类型,是另一个结构体,官方说是因为速度的原因没有使用'