Android使用JSOUP抓取網頁資料－SIN-Android學習筆記

Android使用JSOUP抓取網頁資料

介紹JSOUP

JSOUP是個可以解析HTML以及XML的套件,因為HTML和XML有所謂的Tag,就是<title></title>這個左右大小於括起來的東西

它能取得指定的Tag，一直到下一個</ >結束為止，所以凡是有Tag的網頁，JSOUP都能取得Tag包夾的文字

使用事前先至JSOUP官網將JAR檔載下來並匯入lib

http://jsoup.org/download

以下先使用一個範例來示範

抓取的是網頁中間的欄位結果,可右鍵>檢查元素查看Tag:

http://sports.williamhill.com/bet/zh-hk/results///...

以下範例是抓取這6段Tag內容

    <tr class="rowGroup top bottom">
        <td style="padding-top: 0pt;" class="borderBottom" rowspan="1">賽事投注</td>
        <td></td>
        <td>烏克蘭 (2.75)  </td>
        <td style="padding-top: 0pt; text-align: center" class="borderBottom" rowspan="1">Y</td>
    </tr>
    <tr class="rowGroup top bottom">.....</tr>
    <tr class="rowGroup top bottom">.....</tr>
    <tr class="rowGroup top bottom">.....</tr>
    <tr class="rowGroup top bottom">.....</tr>
    <tr class="rowGroup top bottom">.....</tr>

Android範例:

package com.example.jsoup;
import java.net.URL;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.select.Elements;
import android.app.Activity;
import android.os.Bundle;
import android.widget.TextView;
public class MainActivity extends Activity {
    URL url;
    TextView t01;
    Thread th;
    String te01,te02,te03;
    @Override
    protected void onCreate(Bundle savedInstanceState) {
        super.onCreate(savedInstanceState);
        setContentView(R.layout.activity_main);
        t01=(TextView) this.findViewById(R.id.t01);    //連結TextView
        th=new Thread(r0);                //執行緒
        th.start();                    //讓執行緒開始工作
    }
    private Runnable r0=new Runnable(){
        public void run(){
            try {
                url=new URL("http://sports.williamhill.com/bet/zh-hk/results///E/8386900/thisDate/2015/10/27/9:00:00//-England-%E5%B0%8D-%E7%83%8F%E5%85%8B%E8%98%AD.html");
                Document doc =  Jsoup.parse(url, 3000);        //連結該網址
                Elements title = doc.select("tr[class]");    //抓取為tr且有class屬性的所有Tag
                for(int i=0;i<title.size();i++){            //用FOR個別抓取選定的Tag內容
                    Elements title_select=title.get(i).select("td");//選擇第i個後選取所有為td的Tag
                    te01=title_select.get(0).text();        //只抓取第 0,2,3 Tag的文字
                    te02=title_select.get(2).text();
                    te03=title_select.get(3).text();
                    runOnUiThread(new Runnable() {             //將內容交給UI執行緒做顯示
                            public void run(){   
                                t01.append("\n"+te01+"\n");
                                t01.append(te02+"\n");
                                t01.append(te03+"\n");
                            }     
                        });
                    Thread.sleep(100);    //避免執行緒跑太快而UI執行續顯示太慢,覆蓋掉te01~03內容所以設個延遲,也可以使用AsyncTask-異步任務
                }
            } catch (Exception e) {
                // TODO Auto-generated catch block
                e.printStackTrace();
            } 
        }
    };
}

記住在AndroidManifest.xml文件添加權限:

<uses-permission android:name="android.permission.ACCESS_NETWORK_STATE" />

<uses-permission android:name="android.permission.INTERNET"/>

有的時候使用select()時如果只選取某種Tag,選取到的Tag會相當多,這時候就需要使用條件搜尋,以下是使用正規表示式的方式做搜尋

1.搜尋特定屬性:

[class]、tr[class]

以剛剛的網頁為範例我們就必須搜尋這行Tag,而他就有個class屬性

<tr class="rowGroup top bottom">.....</tr>

所以我們的程式碼就必須寫:

Document doc =  Jsoup.parse(url, 3000);            //連結該網址
Elements title = doc.select("tr[class]");        //抓取Tag為tr且有class屬性的所有Tag

這樣就能抓取到<tr>這種Tag,並且要擁有class屬性的所有Tag

或者是直接使用[class],這樣能抓到的範圍就比較大,是所有擁有class屬性的Tag

Elements title = doc.select("[class]");

2.搜尋特定屬性下的資料限定:

[class=value]、[class^=value]、[class$=value]、[class*=value]、tr[class=value]、tr[class^=value]、tr[class$=value]、tr[class*=value]

以上一個方式可以知道抓取特定屬性其實抓到的Tag還是很多,所以必須再縮小範圍,就有了屬性下的資料限定

一樣是以剛剛的Tag

<tr class="rowGroup top bottom">.....</tr>

a.限定屬性資料完全相同:[class=value]、tr[class=value]

Elements title = doc.select("tr[class=rowGroup top bottom]");    //抓取Tag為tr且有class屬性並限定資料為"rowGroup top bottom"的所有Tag

b.限定屬性資料開頭相同:[class^=value]、tr[class^=value]

Elements title = doc.select("tr[class^=rowGroup]");    //抓取Tag為tr且有class屬性並限定資料開頭為"rowGroup"的所有Tag

c.限定屬性資料結尾相同:[class$=value]、tr[class$=value]

Elements title = doc.select("tr[class$=bottom]");    //抓取Tag為tr且有class屬性並限定資料結尾為"bottom"的所有Tag

d.限定屬性資料裡含有某段文字:[class*=value]、tr[class*=value]

Elements title = doc.select("tr[class*=top]");        //抓取Tag為tr且有class屬性並限定資料含有"top"這段文字的所有Tag

3.限定id屬性:

#id、div#id

以上的Tag是沒有id的,所以挑了一個有id屬性的Tag

<div id="contentCenter" >.....</div>

Elements title = doc.select("div#contentCenter");    //抓取Tag為div且有id屬性並限定資料為"contentCenter"的Tag

4.限定父標籤下的子標籤:

div>table

以下為省略大部分資料的Tag,並做出範例

<div id="contentA">
    .....
    <table    .....>
        .....
        <tbody>
            .....
            <tr class="rowGroup top bottom">
                <td style="padding-top: 0pt;" class="borderBottom" rowspan="1">賽事投注</td>
                <td></td>
                <td>烏克蘭 (2.75)  </td>
                <td style="padding-top: 0pt; text-align: center" class="borderBottom" rowspan="1">Y</td>
            </tr>
            <tr class="rowGroup top bottom">.....</tr>
            <tr class="rowGroup top bottom">.....</tr>
            <tr class="rowGroup top bottom">.....</tr>
            <tr class="rowGroup top bottom">.....</tr>
            <tr class="rowGroup top bottom">.....</tr>
            .....
        </tbody>
        .....
    </table>
    .......
</div>

//抓取Tag為div且有id屬性並限定資料為"contentA",且底下有<table>Tag,在底下又有<tbody>Tag,在底下又有<tr>Tag且還要擁有class屬性的所有Tag
Elements title = doc.select("div#contentA>table>tbody>tr[class]");

5.含有命名空間的Tag

ns|tag

在XML有很多含有命名空間的Tag,如以下Tag並做出範例

<dc:subject>臺北市今明天氣預報</dc:subject>

Elements title = doc.select("dc|subject");    //抓取Tag為<dc:subject>

以上為比較常用的正規表示式搜尋法,如需要更多詳細的等以後有使用到再補上

=====================================================================================================

前面說的都是抓取夾在Tag間的文字,這邊教一下抓取Tag內src或是href屬性內的網址,使用的是attr()

<script type="text/javascript" src="http://whdn.williamhill.com/core/ob/static/cust/js/minified/main_racing.js?ver=723fa5ea119ca4e349d4e10120810d9a"></script>

Document doc =  Jsoup.parse(url, 3000);        //連結該網址
Elements title = doc.select("script[type=text/javascript]");
String te=title.attr("src");            //用來獲取src屬性的值

以上大部分都是取自於參考資料,在其他網頁上測試後的實作範例,如以後有經常使用再補上不足的地方

參考資料:

http://a6350202.pixnet.net/blog/post/148400470-%E3...

http://a6350202.pixnet.net/blog/post/291643105

http://a6350202.pixnet.net/blog/post/294374704