Ben's

삭제의 기술 본문

리눅스

삭제의 기술

Ben Ko (SINCE 2013) 2013. 1. 21. 17:06
728x90
http://www.pc-freak.net/blog/how-to-delete-million-of-files-on-busy-linux-servers-work-out-argument-list-too-long/

How to delete million of files on busy Linux servers (Work out Argument list too long)
How to Delete million or many thousands of files in the same directory on GNU / Linux and FreeBSD

If you try to delete more than 131072 of files on Linux with rm -f *, where the files are all stored in the same directory, you will get an error:

/bin/rm: Argument list too long.


I've earlier blogged on deleting multiple files on Linux and FreeBSD and this is not my first time facing this error.
Anyways, as time passed, I've found few other new ways to delete large multitudes of files from a server.

In this article, I will explain shortly few approaches to delete few million of obsolete files to clean some space on your server.
Here are 3 methods to use to clean your tons of junk files.

1. Using Linux find command to wipe out millions of files

a.) Finding and deleting files using find's -exec switch:

# find . -type f -exec rm -fv {} ;


This method works fine but it has 1 downside, file deletion is too slow as for each found file external rm command is invoked.

For half a million of files or more, using this method will take "long". However from a server hard disk stressing point of view it is not so bad as, the files deletion is not putting too much strain on the server hard disk.
b.) Finding and deleting big number of files with find's -delete argument:

Luckily, there is a better way to delete the files, by using find's command embedded -delete argument:

# find . -type f -print -delete


c.) Deleting and printing out deleted files with find's -print arg

If you would like to output on your terminal, what files find is deleting in "real time" add -print:

# find . -type f -print -delete


To prevent your server hard disk from being stressed and hence save your self from server normal operation "outages", it is good to combine find command with ionice, e.g.:

# ionice -c 3 find . -type f -print -delete


Just note, that ionice cannot guarantee find's opeartions will not affect severely hard disk i/o requests. On heavily busy servers with high amounts of disk i/o writes still applying the ionice will not prevent the server from being hanged! Be sure to always keep an eye on the server, while deleting the files nomatter with or without ionice. if throughout find execution, the server gets lagged in serving its ordinary client requests or whatever, stop the execution of the cmd immediately by killing it from another ssh session or tty (if physically on the server).

2. Using a simple bash loop with rm command to delete "tons" of files

An alternative way is to use a bash loop, to print each of the files in the directory and issue /bin/rm on each of the loop elements (files) like so:

for i in *; do
rm -f $i;
done


If you'd like to print what you will be deleting add an echo to the loop:

# for i in $(echo *); do
echo "Deleting : $i"; rm -f $i;


The bash loop, worked like a charm in my case so I really warmly recommend this method, whenever you need to delete more than 500 000+ files in a directory.

3. Deleting multiple files with perl

Deleting multiple files with perl is not a bad idea at all.
Here is a perl one liner, to delete all files contained within a directory:

# perl -e 'for(<*>){((stat)[9]<(unlink))}'


If you prefer to use more human readable perl script to delete a multitide of files use delete_multple_files_in_dir_perl.pl

Using perl interpreter to delete thousand of files is quick, really, really quick.
I did not benchmark it on the server, how quick exactly is it, but I guess the delete rate should be similar to find command. Its possible even in some cases the perl loop is quicker …

4. Using PHP script to delete a multiple files

Using a short php script to delete files file by file in a loop similar to above bash script is another option.
To do deletion with PHP, use this little PHP script:

$dir = "/path/to/dir/with/files";
$dh = opendir( $dir);
$i = 0;
while (($file = readdir($dh)) !== false) {
$file = "$dir/$file";
if (is_file( $file)) {
unlink( $file);
if (!(++$i % 1000)) {
echo "$i files removedn";
}
}
}
?>


As you see the script reads the $dir defined directory and loops through it, opening file by file and doing a delete for each of its loop elements.
You should already know PHP is slow, so this method is only useful if you have to delete many thousands of files on a shared hosting server with no (ssh) shell access.

This php script is taken from Steve Kamerman's blog . I would like also to express my big gratitude to Steve for writting such a wonderful post. His post actually become inspiration for this article to become reality.

You can also download the php delete million of files script sample here

To use it rename delete_millioon_of_files_in_a_dir.php.txt to delete_millioon_of_files_in_a_dir.php and run it through a browser .

Note that you might need to run it multiple times, cause many shared hosting servers are configured to exit a php script which keeps running for too long.
Alternatively the script can be run through shell with PHP cli:

php -l delete_millioon_of_files_in_a_dir.php.txt.

5. So What is the "best" way to delete million of files on Linux?

In order to find out which method is quicker in terms of execution time I did a home brew benchmarking on my thinkpad notebook.

a) Creating 509072 of sample files.

Again, I used bash loop to create many thousands of files in order to benchmark.
I didn't wanted to put this load on a productive server and hence I used my own notebook to conduct the benchmarks. As my notebook is not a server the benchmarks might be partially incorrect, however I believe still .they're pretty good indicator on which deletion method would be better.

hipo@noah:~$ mkdir /tmp/test
hipo@noah:~$ cd /tmp/test;
hiponoah:/tmp/test$ for i in $(seq 1 509072); do echo aaaa >> $i.txt; done


I had to wait few minutes until I have at hand 509072 of files created. Each of the files as you can read is containing the sample "aaaa" string.

b) Calculating the number of files in the directory

Once the command was completed to make sure all the 509072 were existing, I used a find + wc cmd to calculate the directory contained number of files:

hipo@noah:/tmp/test$ find . -maxdepth 1 -type f |wc -l
509072

real 0m1.886s
user 0m0.440s
sys 0m1.332s


Its intesrsting, using an ls command to calculate the files is less efficient than using find:

hipo@noah:/tmp/test$ time ls -1 |wc -l
509072

real 0m3.355s
user 0m2.696s
sys 0m0.528s


c) benchmarking the different file deleting methods with time

- Testing delete speed of find

hipo@noah:/tmp/test$ time find . -maxdepth 1 -type f -delete
real 15m40.853s
user 0m0.908s
sys 0m22.357s


You see, using find to delete the files is not either too slow nor light quick.

- How fast is perl loop in multitude file deletion ?

hipo@noah:/tmp/test$ time perl -e 'for(<*>){((stat)[9]<(unlink))}'real 6m24.669suser 0m2.980ssys 0m22.673s

Deleting my sample 509072 took 6 mins and 24 secs. This is about 3 times faster than find! GO-GO perl :)
As you can see from the results, perl is a great and time saving, way to delete 500 000 files.

- The approximate speed deletion rate of of for + rm bash loop

hipo@noah:/tmp/test$ time for i in *; do rm -f $i; done

real 206m15.081s
user 2m38.954s
sys 195m38.182s


You see the execution took 195m en 38 secs = 3 HOURS and 43 MINUTES!!!! This is extremely slow ! But works like a charm as the running of deletion didn't impacted my normal laptop browsing. While the script was running I was mostly browsing through few not so heavy (non flash) websites and doing some other stuff in gnome-terminal) :)

As you can imagine running a bash loop is a bit CPU intensive, but puts less stress on the hard disk read/write operations. Therefore its clear using it is always a good practice when deletion of many files on a dedi servers is required.

b) my production server file deleting experience

On a production server I only tested two of all the listed methods to delete my files. The production server, where I tested is running Debian GNU / Linux Squeeze 6.0.3. There I had a task to delete few million of files.
The tested methods tried on the server were:

- The find . type -f -delete method.

- for i in *; do rm -f $i; done

The results from using find -delete method was quite sad, as the server almost hanged under the heavy hard disk load the command produced.

With the for script all went smoothly. The files were deleted for a long long time (like few hours), but while it was running, the server continued with no interruptions..

While the bash loop was running, the server load avarage kept at steady 4
Taking my experience in mind, If you're running a production, server and you're still wondering which delete method to use to wipe some multitude of files, I would recommend you go the bash for loop + /bin/rm way. Yes, it is extremely slow, expect it run for some half an hour or so but puts not too much extra load on the server..

Using the PHP script will probably be slow and inefficient, if compared to both find and the a bash loop.. I didn't give it a try yet, but suppose it will be either equal in time or at least few times slower than bash.

If you have tried the php script and you have some observations, please drop some comment to tell me how it performs.

To sum it up;

Even though there are "hacks" to clean up some messy parsing directory full of few million of junk files, having such a directory should never exist on the first place.

Frankly, keeping millions of files within the same directory is very stupid idea.
Doing so will have a severe negative impact on a directory listing performance of your filesystem in the long term.

If you know better (more efficient) ways to delete a multitude of files in a dir please share in comments.

Share me on:
Google Bookmarks
Identi.ca
Digg
Twitter
del.icio.us
StumbleUpon
Facebook
Sphinn
Mixx
Reddit
Technorati
Print
email
LinkedIn
If you're new here, you may want to subscribe to my RSS feed. Thanks for visiting!

Related Posts
How to search text strings only in hidden files dot (.) files within a directory on Linux and FreeBSD
Tip: Quick Note taking on GNU / Linux and FreeBSD from terminal or console
How to digital watermark to a picture – Protect pic with copyright image or text with composite, convert and Phatch on GNU / Linux
Text Monitoring of connection server (traffic RX / TX) business in ASCII graphs with speedometer / Easy Monitor network traffic performance
How to take area screenshots in GNOME – Take quick area selection screenshots in G* / Linux and BSD
Tags: aaaa, Auto, client, deletec, deleteJust, Deleting, deleting files, Disk, downside, Draft, exec rm, eye, file, file deletion, freebsd, half a million, hard disk, heavy load, hipo, junk files, Linux, linux servers, maxdepth, multiple files, multitudes, multple, nbsp, new ways, noah, nomatter, number, obsolete files, operation, option, point of view, printing, quot, real time, rm command, servers work, Shell, ssh, terminal, test, time, tmp, type, unlink

« Check and Restart Apache if it is malfunctioning (not returning HTML content) shell script
How to work around screen “Cannot open your terminal ‘/dev/pts/1′ – please check.” »
This entry was posted on Tuesday, March 20th, 2012 at 10:07 pm and is filed under Linux, Programming, System Administration. You can follow any responses to this entry through the RSS 2.0 feed. You can leave a response, or trackback from your own site.

One Response to “How to delete million of files on busy Linux servers (Work out Argument list too long)”
korziner says:
April 28, 2012 at 1:07 am
1) Why bash loop does not slow disk, compared to find? What about adding sleep to perl? 2) >keeping millions of files within the same directory is very stupid Is the following true: Even deleted files in directory reduce ext3 performance of ls. It reads inodes marked as free, which were not free. It takes time, compared to virgin inodes.

Reply
Leave a Reply
Click here to cancel reply.
Name (required)

Mail (will not be published) (required)

Website




Find in this blog …
Search for:
GET ARTICLE UPDATES
Enter your email address:


Subscribe
Daily Bible quote

But Hezekiah rendered not again according to the benefit done unto
him; for his heart was lifted up: therefore there was wrath upon
him, and upon Judah and Jerusalem.
-- 2 Chronicles 32:25



Recent Posts
Short introduction to one of the most ancient Churches the Coptic Orthodox Christian Church
The last Anchorite – A documentary (interview) movie about a Coptic Oriental Orthodox Hermit
Father Daniil Sysoev – New times Orthodox Christian Martyr of the 21-st century
Prodigy night and Spirit of Burgas 2012 – few impressions
Ancient Christian Coptic Oriental Orthodox icons – The reason for asymmetric body members in early Christian iconography


Hire me an IT System Administration and System Perofrmance Optimization cheap services


Useful blog? Help it:
Similar Posts
How to delete files in Linux older than 2 days
Howto delete multiple files in Linux and FreeBSD / How to deal with “Argument list too long” error while deleting many files in directory
How to show all sub-directories beloging to a directory with find in Linux
Howto delete empty directories in GNU /Linux with find linux command
Using perl and sed to substitute strings in multiple files on Linux and BSD
Links to Other Places
Cheap Remote System Administration
Древни Църковно Славянски Книги
My ShellScripts
Play Cool FreeBSD ASCII games
Българска Православна Библия
Пророчествата на нашите Православни Светии
Pc-Freak Security
Pc-Freak Homepage
Всичко за Всеки – Блог
exploit-db.com
Linux Weekly News
Online Computer Museum
Hackles Computer Comics
PacketStormSecurity
Remote Exploit.Org

August 2012 M T W T F S S
« Jul
1 2 3 4 5
6 7 8 9 10 11 12
13 14 15 16 17 18 19
20 21 22 23 24 25 26
27 28 29 30 31

Categories
Business Management (37)
Christianity (120)
Computer Security (47)
Entertainment (97)
Everyday Life (227)
FreeBSD (96)
Games Linux (26)
Gnome (11)
Joomla (20)
Linux (366)
Linux and FreeBSD Desktop (195)
Linux Audio & Video (61)
Mobile Phone Apps & Games (25)
Movie Reviews (37)
News (5)
Programming (24)
Qmail (22)
SEO (28)
Skype on Linux (14)
System Administration (474)
Uncategorized (10)
Various (253)
Web and CMS (89)
Windows (41)
Wordpress (28)
About Myself
Name: Georgi D. Georgiev, aka "hip0"

Bio: I am a Free Software enthusiast, hobbyist and a unix geek. Presently my competences are into the field of System administration. I am also a devoted Orthodox Christian. I have deep interests into religion in general and in Christianity in particular. I am a big fan of all kind of Unix like systems like: GNU/Linux, FreeBSD, DOS and other various obscure computing. I'm also interested into philosophy and business administration. My hobbies include playing old arcade games, trips to a new places, preferably nature filled places, Mountain, Waterfalls, Woods etc. In my free time I also like watching movies: Mostly spiritual movies, or movies with a deeper meaning.Currently I am a student in Arnhem Business School in the stream of HRQM (Human Resources and Quality Management). Herein my blog you'll find mostly stuff about my unix/linux adventures, personal life, thoughts on life, religion, philosophy and art.

Recent Comments
Etta Majkut on How to enable Domain Keys (DKIM) in Qmail toaster based mail server install on Debian Linux
admin BULGARIA Mozilla Firefox Windows on The Glorious Prophet Elijah (Elias) taking to heaven – the feast in the Orthodox Church – St. Elijah’s day
Kenyon Mackey on The Glorious Prophet Elijah (Elias) taking to heaven – the feast in the Orthodox Church – St. Elijah’s day
admin BULGARIA Mozilla Firefox Windows on Father Daniil Sysoev – New times Orthodox Christian Martyr of the 21-st century
admin BULGARIA Mozilla Firefox Windows on Revolution OS, a documentary movie about the Free Software Movement and the birth of GNU/Linux
Top Post Views
Some of the most important Symbols for Orthodox Christians in The Eastern Orthodox Church – Symbols in the Eastern Orthodox Christian Faith (Eastern Orthodox Symbolism) and Christian Symbolism in the ... - 6,683 views
Selecting Best Wireless channel / Choosing Best Wi-FI channel for Wireless Routers or (How to improve Wireless Network performance) - 5,644 views
Drawing GANTT Charts and Project Management on Linux, (Microsoft Project substitute for Unix) - 4,399 views
Tux for Kids (Tux Math, Tux Paint, Tux Typing) 3 games to develop your children Intellect - 3,800 views
Howto to detect file encoding and convert default encoding of given files from one encoding to another on GNU/Linux and FreeBSD - 3,408 views
GUI wep/wpa cracking through Gerix Wifi Cracker NG (GUI for cracking wireless networks) - 3,401 views
DOOM 1, DOOM 2, DOOM 3 game wad files for download / Playing Doom on Debian Linux via FreeDoom open source doom engine - 3,324 views
What is the Old Cyrillic / Church Slavonic / Old Bulgarian – large fonts pack to enable writting in Old Cyrillic - 3,189 views
How to Benchmark your Apache Website with siege and Apache Benchmark (ab) on Linux and FreeBSD - 3,096 views
How to properly control your Lenovo Thinkpad R61 fan rotation cycles on Linux with ThinkFan - 2,573 views
blogtopsites
Computers Blogs
Computers blogs
Religion blogs
Listed on: link directory




Epoch Time 은 Unix Time, Unix Timestamp, POSIX time


http://www.tutorialspoint.com/perl/perl_stat.htm

stat FILEHANDLE

stat EXPR

stat



Definition and Usage
Returns a 13-element array giving the status info for a file, specified by either FILEHANDLE, EXPR, or $_. The list of values returned is shown below in Table. If used in a scalar context, returns 0 on failure, 1 on success. Note that support for some of these elements is system dependent.check the documentation for a complete list.

Element Description
0 Device number of file system
1 Inode number
2 File mode (type and permissions)
3 Number of (hard) links to the file
4 Numeric user ID of file.s owner
5 Numeric group ID of file.s owner
6 The device identifier (special files only)
7 File size, in bytes
8 Last access time since the epoch
9 Last modify time since the epoch
10 Inode change time (not creation time!) since the epoch
11 Preferred block size for file system I/O
12 Actual number of blocks allocated



Return Value
ARRAY, ($device, $inode, $mode, $nlink, $uid, $gid, $rdev, $size, $atime, $mtime, $ctime, $blksize, $blocks)

Example
Try out following example:

#!/usr/bin/perl -w

($device, $inode, $mode, $nlink, $uid, $gid, $rdev, $size,
$atime, $mtime, $ctime, $blksize, $blocks) =
stat("/etc/passwd");

print("stat() $device, $inode, $ctime\n");



At my machine it produces following result:

stat() 147, 20212116, 1177094582



http://www.emh.co.kr/xhtml/perl_file_directory.html

참고로, 좀 오래된 펄 코드에서는 파일이름 글로빙을 glob 대신 <> 을 이용한 경우가 많이 있습니다. <> 사이에 패턴을 넣은 것이죠. 이런 방식은 파일핸들 읽어들이는것과 헷깔리니까 가급적 사용하지 마시구요. 다른 사람의 코드를 읽기 위해서 그냥 알아두시기만 하면 되겠습니다. 이런겁니다.
@files = ;
@files = glob ("/home/linuxer/*.mp3");

두개는 같은 것입니다.

my ($atime, $mtime) = ((stat("index.html"))[8,9];
print "$atime\n$mtime\n";

이 코드에서 주의하실 점은요, stat() 에 의해 반환되는 여러 파일 정보 중 9번째, 10번째가 바로 억세스한 시간, 변경한 시간이기 때문에 [8,9] 이 된 것이구요. (배열은 0 부터 센다고 했습니다.) 하나 특기할만한 것은 stat() 앞뒤를 또 하나의 괄호로 감쌌다는 점입니다. 그 괄호를 빼면 위 코드는 에러가 납니다. 왜냐. 어떤값, 변수에 괄호를 씌우는 것은 그 값/변수를 배열로 취급하겠다는 것을 의미합니다. 즉 괄호로 감싸진 것은 리스트 컨텍스트가 된다는 것이죠. 위의 경우 stat() 에 의해 반환된 값들은 부가적인 괄호에 의해 리스트로 취급될것이고, 그 결과 (list)[8,9] 의 형태가 될 수 있는 것입니다. 이해되시죠? 괄호로 감싸면 리스트 컨텍스트가 됩니다. 잊지 마세요.

다시 주제로 돌아와서, 위 코드를 실행해보면 이런 식으로 나옵니다.

1031069374
1030907766

 

'리눅스' 카테고리의 다른 글

vim에서 윈도 개행문자(^M) 삭제하기  (0) 2013.01.21
php cache  (0) 2013.01.21
robots.txt  (0) 2013.01.21
프로세스당 버추얼메모리 사이즈 제한  (0) 2013.01.21
conntrack-tools project  (0) 2013.01.21