JAVA获取网页文本内容

2018/04/06

JAVA获取网页文本内容

主要核心类就是:

URLConnection

代码如下:

public static String sendGet(String url, HashMap<String,String> requestHead) throws Exception {
    URL url1=new URL(url);
    URLConnection connection=url1.openConnection();
    connection.setRequestProperty("Accept","*/*");
    connection.setRequestProperty("Connection","Keep-Alive");

    if(requestHead==null){

    }else{
        for(String key:requestHead.keySet()){
            connection.setRequestProperty(key,requestHead.get(key));
        }
    }
    InputStream inputStream=connection.getInputStream();
    byte[] bytes=new byte[1024];
    ByteArrayOutputStream outputStream=new ByteArrayOutputStream();
    int len=0;
    while((len=inputStream.read(bytes))!=-1){
        outputStream.write(bytes,0,len);
    }
    String ret=new String(outputStream.toByteArray());
    String charset=getWebCharset(ret);
    return new String(outputStream.toByteArray(),charset);
}

其中的getWebCharaset是自动匹配网页编码,代码如下:

public static String getWebCharset(String str){
    String charset="UTF";
    try{
        charset=TextUtil.getMiddleText(str,"charset=",">").substring(0,3);
        charset=charset.replaceAll("\"","");
        charset=charset.replaceAll("'","");
    }catch (NullPointerException e){

    }
    charset=charset.toUpperCase();
    if(charset.startsWith("UT")){
        charset="UTF8";
    }else if(charset.startsWith("GB2")){
        charset="GB2312";
    }else if(charset.startsWith("GBK")){
        charset="GBK";
    }
    return charset;
}

当然匹配的方式有很多种,可以自己实现。

Post Directory