PowerShell search script that ignores binary files

Very important

To access the important data of the forums, you must be active in each forum and especially in the leaks and database leaks section, send data and after sending the data and activity, data and important content will be opened and visible for you.
You will only see chat messages from people who are at or below your level.
More than 500,000 database leaks and millions of account leaks are waiting for you, so access and view with more activity.
Many important data are inactive and inaccessible for you, so open them with activity. (This will be done automatically)

Thread Rating:

484 Vote(s) - 3.43 Average
1
2
3
4
5

Options

PowerShell search script that ignores binary files

undies888

Member

Member

Posts: 0
Threads: 0
Joined: Feb 2020
Reputation: 0

Level: inf [ Level

Level

]
Total Points: inf
Rank nan / 1
100% to upload Level

Rank

Activity inf / 1
99% to upload your Rank

Activity

Experience nan
100% to upload Experience

Experience

Points: 50

#1

07-21-2023, 11:09 AM

I am really used to doing `grep -iIr` on the Unix shell but I haven't been able to get a PowerShell equivalent yet.

Basically, the above command searches the target folders recursively and ignores binary files because of the "-I" option. This option is also equivalent to the `--binary-files=without-match` option, which says *"treat binary files as not matching the search string"*

So far I have been using `Get-ChildItems -r | Select-String` as my PowerShell grep replacement with the occasional `Where-Object` added. But I haven't figured out a way to ignore all binary files like the `grep -I` command does.

How can binary files be filtered or ignored with Powershell?

So for a given path, I only want `Select-String` to search text files.

**EDIT:** A few more hours on Google produced this question [How to identify the contents of a file is ASCII or Binary][1]. The question says "ASCII" but I believe the writer meant "Text Encoded", like myself.

**EDIT:** It seems that an `isBinary()` needs to be written to solve this issue. Probably a C# commandline utility to make it more useful.

**EDIT:** It seems that what `grep` is doing is checking for ASCII *NUL Byte* or UTF-8 *Overlong*. If those exists, it considers the file binary. This is a single *memchr()* call.

[1]:

[To see links please register here]

pretensionless162194

Valued member

Valued member

Posts: 0
Threads: 0
Joined: Aug 2018
Reputation: 0

Level: inf [ Level

Level

]
Total Points: inf
Rank nan / 1
100% to upload Level

Rank

Activity inf / 1
99% to upload your Rank

Activity

Experience nan
100% to upload Experience

Experience

Points: 50

#2

07-21-2023, 11:26 AM

Ok, after a few more hours of research I believe I've found my solution. I won't mark this as the answer though.

[Pro Windows Powershell][1] had a very similar example. I had completely forgot that I had this excellent reference. Please buy it if you are interested in Powershell. It went into detail on Get-Content and Unicode BOMs.

This [Answer][2] to a similar questions was also very helpful with the Unicode identification.

Here is the script. Please let me know if you know of any issues it may have.

# The file to be tested
param ($currFile)

# encoding variable
$encoding = ""

# Get the first 1024 bytes from the file
$byteArray = Get-Content -Path $currFile -Encoding Byte -TotalCount 1024

if( ("{0:X}{1:X}{2:X}" -f $byteArray) -eq "EFBBBF" )
{
# Test for UTF-8 BOM
$encoding = "UTF-8"
}
elseif( ("{0:X}{1:X}" -f $byteArray) -eq "FFFE" )
{
# Test for the UTF-16
$encoding = "UTF-16"
}
elseif( ("{0:X}{1:X}" -f $byteArray) -eq "FEFF" )
{
# Test for the UTF-16 Big Endian
$encoding = "UTF-16 BE"
}
elseif( ("{0:X}{1:X}{2:X}{3:X}" -f $byteArray) -eq "FFFE0000" )
{
# Test for the UTF-32
$encoding = "UTF-32"
}
elseif( ("{0:X}{1:X}{2:X}{3:X}" -f $byteArray) -eq "0000FEFF" )
{
# Test for the UTF-32 Big Endian
$encoding = "UTF-32 BE"
}

if($encoding)
{
# File is text encoded
return $false
}

# So now we're done with Text encodings that commonly have '0's
# in their byte steams. ASCII may have the NUL or '0' code in
# their streams but that's rare apparently.

# Both GNU Grep and Diff use variations of this heuristic

if( $byteArray -contains 0 )
{
# Test for binary
return $true
}

# This should be ASCII encoded
$encoding = "ASCII"

return $false

Save this script as *isBinary.ps1*

This script got every text or binary file I tried correct.

[1]:

[To see links please register here]

[2]:

[To see links please register here]

keeperless324833

Valued member

Valued member

Posts: 0
Threads: 0
Joined: Oct 2020
Reputation: 0

Level: inf [ Level

Level

]
Total Points: inf
Rank nan / 1
100% to upload Level

Rank

Activity inf / 1
99% to upload your Rank

Activity

Experience nan
100% to upload Experience

Experience

Points: 50

#3

07-21-2023, 11:32 AM

On Windows, file extensions are usually good enough:

# all C# and related files (projects, source control metadata, etc)
dir -r -fil *.cs* | ss foo

# exclude the binary types most likely to pollute your development workspace
dir -r -exclude *exe, *dll, *pdb | ss foo

# stick the first three lines in your $profile (refining them over time)
$bins = new-list string
$bins.AddRange( [string[]]@("exe", "dll", "pdb", "png", "mdf", "docx") )
function IsBin([System.IO.FileInfo]$item) { !$bins.Contains($item.extension.ToLower()) }
dir -r | ? { !IsBin($_) } | ss foo

But of course, file extensions are not perfect. Nobody likes typing long lists, and plenty of files are misnamed anyway.

I don't think Unix has any special binary vs text indicators in the filesystem. (Well, VMS did, but I doubt that's the source of your grep habits.) I looked at the implementation of Grep -I, and apparently it's just a quick-n-dirty heuristic based on the first chunk of the file. Turns out that's a strategy I have [a bit of experience][1] with. So here's my advice on choosing a heuristic function that is appropriate for Windows text files:

* Examine at least 1KB of the file. Lots of file formats begin with a header that looks like text but will bust your parser shortly afterward. The way modern hardware works, reading 50 bytes has roughly the same I/O overhead as reading 4KB.
* If you only care about straight ASCII, exit as soon you see something outside the character range [31-127 plus CR and LF]. You might accidentally exclude some clever ASCII art, but trying to separate those cases from binary junk is nontrivial.
* If you want to handle Unicode text, let MS libraries handle the dirty work. It's harder than you think. From Powershell you can easily access the [IMultiLang2 interface][2] (COM) or [Encoding.GetEncoding][3] static method (.NET). Of course, they are still just guessing. Raymond's comments on the [Notepad detection algorithm][4] (and the link within to Michael Kaplan) are worth reviewing before deciding exactly how you want to mix & match the platform-provided libraries.
* If the outcome is important -- ie a flaw will do something worse than just clutter up your grep console -- then don't be afraid to hard-code some file extensions for the sake of accuracy. For example, *.PDF files occasionally have several KB of text at the front despite being a binary format, leading to the notorious bugs linked above. Similarly, if you have a file extension that is likely to contain XML or XML-like data, you might try a detection scheme similar to [Visual Studio's HTML editor][5]. (SourceSafe 2005 actually borrows this algorithm for some cases)
* Whatever else happens, have a reasonable backup plan.

As an example, here's the quick ASCII detector:

function IsAscii([System.IO.FileInfo]$item)
{
begin
{
$validList = new-list byte
$validList.AddRange([byte[]] (10,13) )
$validList.AddRange([byte[]] (31..127) )
}

process
{
try
{
$reader = $item.Open([System.IO.FileMode]::Open)
$bytes = new-object byte[] 1024
$numRead = $reader.Read($bytes, 0, $bytes.Count)

for($i=0; $i -lt $numRead; ++$i)
{
if (!$validList.Contains($bytes[$i]))
{ return $false }
}
$true
}
finally
{
if ($reader)
{ $reader.Dispose() }
}
}
}

The usage pattern I'm targeting is a where-object clause inserted in the pipeline between "dir" and "ss". There are other ways, depending on your scripting style.

Improving the detection algorithm along one of the suggested paths is left to the reader.

[1]:

[To see links please register here]

[2]:

[To see links please register here]

[3]:

[To see links please register here]

[4]:

[To see links please register here]

[5]:

[To see links please register here]

edit: I started replying to your comment in a comment of my own, but it got too long...

Above, I looked at the problem from the POV of whitelisting known-good sequences. In the application I maintained, incorrectly storing a binary as text had far worse consequences than vice versa. The same is true for scenarios where you are choosing which FTP transfer mode to use, or what kind of MIME encoding to send to an email server, etc.

In other scenarios, blacklisting the obviously bogus and allowing everything else to be called text is an equally valid technique. While U+0000 is a valid code point, it's pretty much never found in real world text. Meanwhile, \00 is quite common in structured binary files (namely, whenever a fixed-byte-length field needs padding), so it makes a great simple blacklist. VSS 6.0 used this check alone and did ok.

Aside: *.zip files are a case where checking for \0 is riskier. Unlike most binaries, their structured "header" (footer?) block is at the end, not the beginning. Assuming ideal entropy compression, the chance of no \0 in the first 1KB is (1-1/256)^1024 or about 2%. Luckily, simply scanning the rest of the 4KB cluster NTFS read will drive the risk down to 0.00001% without having to change the algorithm or write another special case.

To exclude invalid UTF-8, add \C0-C1 and \F8-FD and \FE-FF (once you've seeked past the possible BOM) to the blacklist. Very incomplete since you're not actually validating the sequences, but close enough for your purposes. If you want to get any fancier than this, it's time to call one of the platform libraries like IMultiLang2::DetectInputCodepage.

Not sure why \C8 (200 decimal) is on Grep's list. It's not an overlong encoding. For example, the sequence \C8 \80 represents Ȁ (U+0200). Maybe something specific to Unix.

meatoscopy654979

Valued member

Valued member

Posts: 0
Threads: 0
Joined: Apr 2022
Reputation: 0

Level: inf [ Level

Level

]
Total Points: inf
Rank nan / 1
100% to upload Level

Rank

Activity inf / 1
99% to upload your Rank

Activity

Experience nan
100% to upload Experience

Experience

Points: 50

#4

07-21-2023, 11:40 AM

i agree that the other answers are more 'complete' but - because i do not know what file extensions i will encounter within a folder and i want to look thru them all, this is the easiest solution for me.
how about instead of avoiding searching thru binary files you just ignore the errors that you get from searching thru binary files?
it doesn't take long to run a search even if there are binary files within the folder being searched.
in the end, all that you care about is the strings that match the pattern (which there is next to no chance of it would find a string that matches the pattern inside of a binary file).

GCI -Recurse -Force -ErrorAction SilentlyContinue | ForEach-Object { GC $_ -ErrorAction SilentlyContinue | Select-String -Pattern "Pattern" } | Out-File -FilePath C:\temp\grep.txt -Width 999999

« Next Oldest

Next Newest »

Forum Jump:

Users browsing this thread:

1 Guest(s)

©0Day 2016 - 2023 | All Rights Reserved. Made with for the community. Connected through