I recently participated in a thread on reddit that showed the approach Facebook takes to obfuscate its Sponsored posts so that it remains undecipherable to ad block softwares. Facebook — quite understandably — goes to immense lengths to make sure the posts with ads are not easily discoverable through scripts that rely on DOM manipulation to block ads.
I argue that anything relying on DOM structure would easily be fooled by developers who will change the markup in their next update. Instead of deciphering the jumbled markup to understand which posts are ads, it would be beneficial to focus on attributes that can’t change.
There are at least 2 things that absolutely must not change in Facebook’s post, the visible “Sponsored” text, and its accessibility label. The first can be used by rendering each post div into a canvas, clipping out a region around the “Sponsored” text, run a quick OCR analysis, and hide the divs that meet the criteria. Since this required a little more than a few minutes, I chose to focus on the second approach. Find the accessibility label and follow the trail.
Now, we know that each post that contains ads is required by some mystical legal stuff to include the “Sponsored” text. Now although the visible label is severely jumbled, the accessibility that will be read by the screen readers, must come from somewhere that is not split crazy — this also happens to be a requirement of some other mystical legal stuff.
So first, let’s find where the label is. From the markup we find the aria-labelledby attribute:
<span aria-labelledby="jsc_c_12" aria-label="label" class="gpro0wi8 j1lvzwm4 stjgntxs ni8dbmo4 q9uorilb">
Further id lookups reveal that the labels are contained in different SPANS, and I would bet that the id’s of these are dynamically generated.
Once we’ve found the unaltered text, we can start following its trail —
Find SPANS whose innerText is Sponsored — we can derive this using an XPath query like so:
sp = document.evaluate("//span[contains(.,'Sponsored')]", document)
This gives us a list of nodes, that will match the criteria. FB uses different labels for individual ads. So get IDs of these spans:
In my case I get "jsc_c_7, jsc_c_12, jsc_c_13".
Now find the DOM subtree for items in FB feed:
nodes = [...document.querySelectorAll("div[data-testid='Keycommand_wrapper_feed_story'")]
For each feed item node, check which one contains the span with arai-labelledby attribute matching one of those SPANS from above:
ads = nodes.filter((n) => n.querySelectorAll("[aria-labelledby='jsc_c_12']").length > 0)
And, viola! We’ve located all posts that are actually ads in our Facebook feed.
Having said that, it may sound simple, however the feed is rendered only within the viewport. The posts are rendered and hidden continuously while we scroll, thereby anything that was previously hidden will come back. To make this effective, we could attach this snippet to the window scroll event to continuously hide ads as they show up within the viewport.
For the sake of completeness — here are a few script nuggets for couple other similar websites:
[…document.querySelectorAll(“div[class~=Post]”)].filter((n) => document.evaluate(“.//span[contains(.,’promoted’)]”, n).iterateNext() !== null).forEach((d) => d.remove());
[…document.querySelectorAll(“div[data-id^=’urn:li:activity:’]”)].filter((n) => document.evaluate(“.//span[contains(.,’Promoted’)]”, n).iterateNext() !== null).forEach((d) => d.remove());