How to extract the full HTML source from the current page

Add-in Express™ Support Service
That's what is more important than anything else

How to extract the full HTML source from the current page
full html source code 
F André




Posts: 32
Joined: 2010-01-06
Hello,

I'm trying to extract the full HTML source code from the current page. I want to have the same result as when I do "View --> Source".

But these properties does not contain the "same" HTML, the DOCTYPE tag is missing for example :

HTMLDocument.documentElement.outerHTML
or
HTMLDocument.documentElement.innerHTML

Is there other property with the full HTML source code ?

thank you


André
Posted 04 Aug, 2011 09:09:17 Top
Sergey Grischenko


Add-in Express team


Posts: 7224
Joined: 2004-07-05
Hi André.

Did you try using the doctype property of the IHTMLDocument interface?
Posted 05 Aug, 2011 08:28:37 Top
F André




Posts: 32
Joined: 2010-01-06
Sorry I'm unable to reply, I get the error message "Message text." when I try to post...

do you filter contents ? I have some links on the reply.
Posted 22 Aug, 2011 04:53:33 Top
F André




Posts: 32
Joined: 2010-01-06
Hello Sergey,

finally I used this framwork for this task : csEXWB (http://tinyurl.com/yw3c25)

Because concatenating the doctype will not give me the full HTML source as you can get from "View --> Source".

It seems that PInvokes are necessary to get the HTML source.

I use the class cEXWB like that :


cEXWB myWB = new cEXWB();
string htmlSource = myWB.GetSource(IEApp as IfacesEnumsStructsClasses.IWebBrowser2);


Now I'm just having string encoding problems, I'll fix that with this another framework : http://tinyurl.com/qlvx5l

Code extract from csEXWB :


public string GetSource(IWebBrowser2 thisBrowser)
{
if ((thisBrowser == null) || (thisBrowser.Document == null))
return string.Empty;

//Declare vars
int hr = Hresults.S_OK;
IStream pStream = null;
IPersistStreamInit pPersistStreamInit = null;

// Query for IPersistStreamInit.
pPersistStreamInit = thisBrowser.Document as IPersistStreamInit;
if (pPersistStreamInit == null)
return string.Empty;

//Create stream, delete on release
hr = WinApis.CreateStreamOnHGlobal(m_NullPointer, true, out pStream);
if ((pStream == null) || (hr != Hresults.S_OK))
return string.Empty;

//Save
hr = pPersistStreamInit.Save(pStream, false);
if (hr != Hresults.S_OK)
return string.Empty;

//Now read from stream....

//First get the size
long ulSizeRequired = (long)0;
//LARGE_INTEGER
long liBeggining = (long)0;
System.Runtime.InteropServices.ComTypes.STATSTG statstg = new System.Runtime.InteropServices.ComTypes.STATSTG();
pStream.Seek(liBeggining, (int)tagSTREAM_SEEK.STREAM_SEEK_SET, m_NullPointer);
pStream.Stat(out statstg, (int)tagSTATFLAG.STATFLAG_NONAME);

//Size
ulSizeRequired = statstg.cbSize;
if (ulSizeRequired == (long)0)
return string.Empty;

//Allocate buffer + read
byte[] pSource = new byte[ulSizeRequired];
pStream.Read(pSource, (int)ulSizeRequired, m_NullPointer);

//Added by schlaup to handle UTF8 and UNICODE pages
//Convert
//ASCIIEncoding asce = new ASCIIEncoding();
//return asce.GetString(pSource);
Encoding enc = null;

if (pSource.Length > 8)
{
// Check byte order mark
if ((pSource[0] == 0xFF) && (pSource[1] == 0xFE)) // UTF16LE
enc = Encoding.Unicode;

if ((pSource[0] == 0xFE) && (pSource[1] == 0xFF)) // UTF16BE
enc = Encoding.BigEndianUnicode;

if ((pSource[0] == 0xEF) && (pSource[1] == 0xBB) && (pSource[2] == 0xBF)) //UTF8
enc = Encoding.UTF8;

if (enc == null)
{
// Check for alternating zero bytes which might indicate Unicode
if ((pSource[1] == 0) && (pSource[3] == 0) && (pSource[5] == 0) && (pSource[7] == 0))
enc = Encoding.Unicode;
}
}

if (enc == null)
enc = Encoding.Default;

int bomLength = enc.GetPreamble().Length;

return enc.GetString(pSource, bomLength, pSource.Length - bomLength);
}
Posted 22 Aug, 2011 05:03:53 Top
Sergey Grischenko


Add-in Express team


Posts: 7224
Joined: 2004-07-05
Hi André,

Thank you very much for sharing your solution.
Posted 22 Aug, 2011 05:12:33 Top